OGI

Pretraining

Pretraining produces the base model that downstream stages specialize. Its purpose is to absorb broad statistical regularity from large, mixed-modality corpora before any task-specific signal is applied.

Compute-optimal scaling

Resource allocation between model size and training tokens follows the compute-optimal regime. For compute budget CC in FLOPs, training loss L\mathcal{L} is bounded below by

L(N,D)=E+ANα+BDβ\mathcal{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where NN is parameter count, DD is training token count, and (E,A,B,α,β)(E, A, B, \alpha, \beta) are fitted constants. The Chinchilla operating point, α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28, implies the compute-optimal allocation

NC0.5,DC0.5N^* \propto C^{0.5},\qquad D^* \propto C^{0.5}

i.e. compute should be split approximately evenly between scaling parameters and scaling data. Earlier work recommended a parameter-heavy split that is empirically suboptimal at fixed compute.

The codex's base model targets the compute-optimal point for the available training budget per phase, with two deliberate adjustments:

  1. Inference-cost weighting. Because the model is served continuously, a smaller-than-optimal model is preferable when inference cost dominates training cost over the deployment lifetime. The codex's models are typically 20–40% smaller than the strict compute-optimal point.
  2. Modality-aware token counting. Multi-modal tokens are not interchangeable. Pretraining accounts for token "exchange rates" measured empirically: an image patch token contributes approximately 0.3–0.6× as much downstream signal as a language token, depending on task.

Corpus

The pretraining corpus is a mixture of public and validator-contributed sources:

SourceShareModalities
Filtered web crawl32%text, image
Code14%text
Books, papers, technical8%text
Video18%video, audio, captions
Embodied demonstrations12%video, proprioception, action
Synthetic + curated16%mixed

The exact mixture shifts per pretraining run and is published with each checkpoint as a manifest hash. License compliance is enforced at the manifest level: every source carries a license tag and incompatible sources are excluded by the build pipeline.

The embodied demonstration share is the codex's unique contribution. No centralized lab has access to a comparable distributed source of embodied data, and the share grows monotonically as the validator network expands.

Objectives

Pretraining minimizes a weighted sum of objectives:

Lpre=λ1LLM+λ2Lvision+λ3Lalign+λ4Lworld+λ5Lembod\mathcal{L}_{\mathrm{pre}} = \lambda_1 \mathcal{L}_{\mathrm{LM}} + \lambda_2 \mathcal{L}_{\mathrm{vision}} + \lambda_3 \mathcal{L}_{\mathrm{align}} + \lambda_4 \mathcal{L}_{\mathrm{world}} + \lambda_5 \mathcal{L}_{\mathrm{embod}}

with:

  • LLM\mathcal{L}_{\mathrm{LM}}: next-token cross-entropy on text + code.
  • Lvision\mathcal{L}_{\mathrm{vision}}: masked-patch reconstruction + contrastive.
  • Lalign\mathcal{L}_{\mathrm{align}}: cross-modal contrastive between vision and language latents.
  • Lworld\mathcal{L}_{\mathrm{world}}: next-chunk latent prediction conditioned on a learned null action.
  • Lembod\mathcal{L}_{\mathrm{embod}}: flow-matching on embodied demonstration trajectories.

The loss weights λi\lambda_i are tuned per phase to balance gradient norms across objectives; a Lagrangian-style auto-tuner adjusts them on the fly to keep no single objective from dominating.

Optimization

Pretraining uses AdamW with the standard cosine schedule:

ηt=ηmin+12(ηmaxηmin)(1+cos ⁣πtT)\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\!\frac{\pi t}{T}\right)

after a linear warmup over the first 2% of steps. Gradient clipping at 2=1.0\|\nabla\|_2 = 1.0. Batch size scales with compute, capped at the critical batch size for the data distribution.

Distributed training

Pretraining at scale is sharded with 3D parallelism: data parallel across replicas, tensor parallel within nodes, pipeline parallel across stages. The codex uses a custom topology adapter that auto-discovers the validator cluster's interconnect and chooses sharding to maximize HBM bandwidth utilization.

This is the most network-intensive computation in the system and is reserved for validators in the highest hardware tier; see Validators.

Checkpoints and lineage

Every pretraining run produces a sequence of checkpoints, each carrying:

  • A content hash of the weights.
  • A manifest hash of the training data.
  • A hash chain back to the starting checkpoint (the run's lineage).
  • An eval-suite score vector against the standard benchmarks.

Lineage is verifiable: any third party can re-derive the checkpoint hash from the lineage and a small set of intermediate proofs.