Cognition: Overview
Cognition, in this codex, is everything that happens between sensors and effectors that is not itself a sensor or effector. It is the work performed inside the hourglass.
Six surfaces are documented in this part: language, vision, multimodal fusion, reasoning & planning, memory, and world models. The seventh surface, embodiment, is documented in Part II because its constraints are sufficiently distinct to warrant separate treatment.
The shared latent
Every cognitive surface reads from and writes to the shared latent specified in The Shape of AGI. Concretely, the latent is a sequence of d-dimensional vectors with d = 4096, variable length up to 8,192 positions. Surfaces communicate by enqueueing tokens into this sequence and consuming tokens from it.
The latent is not natural language. It is a learned alphabet of approximately 65,536 discrete codes, augmented by continuous residuals. A language tokenizer is one encoder among many; an image patch encoder is another; a proprioceptive encoder is a third.
What surfaces share
All cognitive surfaces share:
- The latent (above).
- A common attention substrate. Every cognitive module is a transformer variant, multi-head attention with rotary position embeddings, SwiGLU feedforward, RMSNorm.
- A common parameter sharding scheme. Surfaces can be served independently or as a fused stack.
- A common evaluation protocol. Every surface has an eval suite versioned at the same cadence as its weights.
What surfaces do not share
- Action spaces. Language emits tokens. Vision emits regions or captions. Reasoning emits plans. Embodiment emits continuous controls. Each surface has its own decoder.
- Loss functions. Language is trained on next-token. Vision is trained on contrastive + reconstruction. Embodiment is trained on flow-matching against demonstration trajectories.
- Latency budgets. Reasoning may take seconds. Inference for embodied control must complete in tens of milliseconds.
- Validator hardware tiers. A validator that serves embodiment may have stricter GPU requirements than one that serves language only. See Validators.
Composition
Surfaces compose. A typical query traverses multiple surfaces in sequence:
user query (text) → language encoder → latent
↓
reasoning over latent
↓
memory recall conditioned on latent
↓
vision encoder consumes image, writes latent
↓
embodiment decoder emits motor plan
This composition is not orchestrated by an external scheduler. Each surface is a transformer block that attends to whatever latent tokens are present and emits its own. A run is complete when all active surfaces produce no new tokens for one step.
What is not in this part
This part does not specify:
- The wire format between surfaces (Part IV, Network).
- The training data for each surface (Part III, Learning).
- The economic incentives for serving each surface (Part V, Protocol).
The remaining chapters of Part I treat each surface in isolation. The reader should keep the hourglass and the shared latent in mind while reading them; in isolation, the surfaces look more independent than they are.