Language
The language surface is responsible for encoding and producing token streams in natural human languages and structured formal languages (code, math, query languages, formal logic). It is the most studied surface in the field and the most thoroughly specified here.
Architecture
The language surface is a decoder-only transformer with the following profile:
| Property | Value |
|---|---|
| Parameters (active) | 7.0B dense + 32B sparse experts |
| Layers | 48 |
| Attention heads | 32 |
| Hidden dimension | 4,096 (matches the shared latent) |
| Context window | 131,072 tokens |
| Vocabulary | 256,000 BPE tokens |
| Position encoding | RoPE with NTK scaling |
| FFN | SwiGLU; sparse mixture-of-experts in odd layers |
The sparse-expert path is activated only when the latent indicates a domain shift requiring it; in the steady state, the dense path handles approximately 78% of forward passes. Routing is learned and observable; routing weights are part of the published checkpoint.
Tokenization
A single tokenizer covers natural language, code, and protocol-level formal sublanguages (instruction syntax, embodiment specifications, math). The tokenizer is byte-fallback; arbitrary bytes can be encoded without loss. Reserved tokens delimit modality boundaries within the shared latent (<vision>, <motor>, <plan>, <memory>).
Input pathways
The language surface receives input from:
- Direct text. User-supplied or document-derived.
- Vision captions. Emitted by the vision surface and tokenized back into language.
- Memory recall. Surfaced from long-term memory by the memory surface.
- Plans. Structured outputs from the reasoning surface, rendered as tokens.
The surface does not distinguish these origins at the architectural level; they are differentiated only by the special tokens that prefix them.
Output pathways
The surface emits to:
- User output. Detokenized to UTF-8.
- Reasoning surface. As intermediate steps in chain-of-thought.
- Embodiment surface. As high-level instructions to be grounded into motor plans.
- Memory surface. As candidates for long-term retention.
Training signals
The surface is trained against four objectives, mixed during pretraining and rebalanced during fine-tuning:
- Next-token prediction on the unfiltered web crawl, deduplicated and license-filtered.
- Span infilling on the code subset and on the structured subset.
- Instruction following on a curated mixture of public and validator-contributed instruction data.
- Direct preference optimization on human and model-judged comparisons.
The pretraining corpus and the fine-tuning mixtures are specified in Pretraining.
Latency budget
The language surface declares the following latency tiers:
| Tier | TTFT | Throughput | Routed to |
|---|---|---|---|
| Interactive | < 250ms | > 80 tok/s | Default user queries |
| Bulk | < 5s | > 200 tok/s | Long-context summarization, batch jobs |
| Background | < 60s | best-effort | Memory consolidation, model self-talk |
Validators advertise the tier they serve. Routing across tiers is automatic; users do not select.
Verification
Language outputs are verified by the network through three mechanisms:
- Deterministic re-execution. A sampled subset of queries is re-run on a second validator with the same seed; outputs must match to within an exact-match threshold.
- Eval-suite sampling. A held-out eval suite is periodically run against the active checkpoint. Performance regressions trigger automatic rollback.
- Cross-surface consistency. Outputs are checked against the latent state they were produced from; outputs incompatible with the surrounding latent are penalized.
See Validators for the economic mechanics of these checks.
What this surface does not do
It does not reason. Multi-step problem solving lives in the reasoning surface, which the language surface can call. It does not remember. Persistent state across sessions lives in the memory surface. It does not act on the world. Physical action lives in the embodiment surface.
A language model that performs all four functions is, in the codex's terminology, an undocumented monolith. The codex deliberately decomposes.