Pretraining
Pretraining produces the base model that downstream stages specialize. Its purpose is to absorb broad statistical regularity from large, mixed-modality corpora before any task-specific signal is applied.
Compute-optimal scaling
Resource allocation between model size and training tokens follows the compute-optimal regime. For compute budget in FLOPs, training loss is bounded below by
where is parameter count, is training token count, and are fitted constants. The Chinchilla operating point, , , implies the compute-optimal allocation
i.e. compute should be split approximately evenly between scaling parameters and scaling data. Earlier work recommended a parameter-heavy split that is empirically suboptimal at fixed compute.
The codex's base model targets the compute-optimal point for the available training budget per phase, with two deliberate adjustments:
- Inference-cost weighting. Because the model is served continuously, a smaller-than-optimal model is preferable when inference cost dominates training cost over the deployment lifetime. The codex's models are typically 20–40% smaller than the strict compute-optimal point.
- Modality-aware token counting. Multi-modal tokens are not interchangeable. Pretraining accounts for token "exchange rates" measured empirically: an image patch token contributes approximately 0.3–0.6× as much downstream signal as a language token, depending on task.
Corpus
The pretraining corpus is a mixture of public and validator-contributed sources:
| Source | Share | Modalities |
|---|---|---|
| Filtered web crawl | 32% | text, image |
| Code | 14% | text |
| Books, papers, technical | 8% | text |
| Video | 18% | video, audio, captions |
| Embodied demonstrations | 12% | video, proprioception, action |
| Synthetic + curated | 16% | mixed |
The exact mixture shifts per pretraining run and is published with each checkpoint as a manifest hash. License compliance is enforced at the manifest level: every source carries a license tag and incompatible sources are excluded by the build pipeline.
The embodied demonstration share is the codex's unique contribution. No centralized lab has access to a comparable distributed source of embodied data, and the share grows monotonically as the validator network expands.
Objectives
Pretraining minimizes a weighted sum of objectives:
with:
- : next-token cross-entropy on text + code.
- : masked-patch reconstruction + contrastive.
- : cross-modal contrastive between vision and language latents.
- : next-chunk latent prediction conditioned on a learned
nullaction. - : flow-matching on embodied demonstration trajectories.
The loss weights are tuned per phase to balance gradient norms across objectives; a Lagrangian-style auto-tuner adjusts them on the fly to keep no single objective from dominating.
Optimization
Pretraining uses AdamW with the standard cosine schedule:
after a linear warmup over the first 2% of steps. Gradient clipping at . Batch size scales with compute, capped at the critical batch size for the data distribution.
Distributed training
Pretraining at scale is sharded with 3D parallelism: data parallel across replicas, tensor parallel within nodes, pipeline parallel across stages. The codex uses a custom topology adapter that auto-discovers the validator cluster's interconnect and chooses sharding to maximize HBM bandwidth utilization.
This is the most network-intensive computation in the system and is reserved for validators in the highest hardware tier; see Validators.
Checkpoints and lineage
Every pretraining run produces a sequence of checkpoints, each carrying:
- A content hash of the weights.
- A manifest hash of the training data.
- A hash chain back to the starting checkpoint (the run's lineage).
- An eval-suite score vector against the standard benchmarks.
Lineage is verifiable: any third party can re-derive the checkpoint hash from the lineage and a small set of intermediate proofs.