The Shape of AGI

A general intelligence has a shape. This document specifies it.

The seven surfaces

The codex partitions cognition into seven surfaces. The partition is not arbitrary; it reflects the fact that different surfaces have different training signals, different evaluation regimes, and different latency budgets.

#	Surface	Native modality	Native action
1	Language	Token streams	Token streams
2	Vision	Pixel grids, point clouds	Region selections, captions
3	Audio	Waveforms, spectrograms	Transcriptions, prosody
4	Reasoning	Mixed	Plans, proofs, decisions
5	Memory	Mixed	Recall, summary, abstention
6	World models	Mixed	Rollouts, predictions
7	Embodiment	Joint states, forces	Continuous control

Each surface is documented separately. The codex does not claim that intelligence is exhausted by these seven, but it claims that no narrower partition is sufficient.

The hourglass

The seven surfaces are not stacked. They share a narrow middle layer, a learned representation space, through which every surface communicates.

   language    vision    audio    reasoning   memory   world    embodiment
       \         |         |         |         |        |          /
        \        |         |         |         |        |         /
         \       |         |         |         |        |        /
                       ─── shared latent ───
         /       |         |         |         |        |        \
        /        |         |         |         |        |         \
       /         |         |         |         |        |          \
    decoders & action heads (per surface)

The shared latent is a sequence of tokens, distinct from natural-language tokens, that any surface can produce or consume. Concretely, language encoders, vision encoders, and proprioceptive encoders all emit into a common embedding space; reasoning and memory operate over this space; surface-specific decoders emit final outputs.

This is the hourglass. It is the load-bearing assumption of the entire codex: surfaces converge on a shared representation and diverge again into surface-specific outputs. Without it, the protocol cannot fund "a model", it would have to fund many.

Why this shape

Three forces select for the hourglass:

Sample efficiency. A surface trained against the shared latent inherits competence from every other surface trained against the same latent. A vision system trained alongside a language system learns to ground entities; a manipulation system trained alongside a vision system learns to plan around occlusions.

Transfer. New surfaces (e.g., olfaction, kinesthesia, programs-as-actions) can be added by training a new encoder and decoder against the existing latent. The cost of adding a surface is bounded.

Verifiability. A network of validators verifying surface-specific outputs is straightforward. A network verifying a single shared representation is borderline impossible. By keeping the shared layer narrow and the surfaces wide, the protocol's verification burden lives where it can be paid.

What this implies for the network

Every section that follows is a consequence of this shape.

Cognition (Part I) documents the surfaces above the hourglass.
Embodiment (Part II) documents the action-side of the hourglass for physical bodies.
Learning (Part III) documents how the hourglass is trained, surface by surface and end-to-end.
Network (Part IV) documents who runs the hourglass and how their work is checked.
Protocol (Part V) documents the asset that funds the hourglass.

If the codex reads, at any point, as a collection of unrelated subsystems, return to this document.