Pretraining

Pretraining produces the base model that downstream stages specialize. Its purpose is to absorb broad statistical regularity from large, mixed-modality corpora before any task-specific signal is applied.

Compute-optimal scaling

Resource allocation between model size and training tokens follows the compute-optimal regime. For compute budget $C$ in FLOPs, training loss $\mathcal{L}$ is bounded below by

$\mathcal{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$

where $N$ is parameter count, $D$ is training token count, and $(E, A, B, \alpha, \beta)$ are fitted constants. The Chinchilla operating point, $\alpha \approx 0.34$ , $\beta \approx 0.28$ , implies the compute-optimal allocation

$N^* \propto C^{0.5},\qquad D^* \propto C^{0.5}$

i.e. compute should be split approximately evenly between scaling parameters and scaling data. Earlier work recommended a parameter-heavy split that is empirically suboptimal at fixed compute.

The codex's base model targets the compute-optimal point for the available training budget per phase, with two deliberate adjustments:

Inference-cost weighting. Because the model is served continuously, a smaller-than-optimal model is preferable when inference cost dominates training cost over the deployment lifetime. The codex's models are typically 20–40% smaller than the strict compute-optimal point.
Modality-aware token counting. Multi-modal tokens are not interchangeable. Pretraining accounts for token "exchange rates" measured empirically: an image patch token contributes approximately 0.3–0.6× as much downstream signal as a language token, depending on task.

Corpus

The pretraining corpus is a mixture of public and validator-contributed sources:

Source	Share	Modalities
Filtered web crawl	32%	text, image
Code	14%	text
Books, papers, technical	8%	text
Video	18%	video, audio, captions
Embodied demonstrations	12%	video, proprioception, action
Synthetic + curated	16%	mixed

The exact mixture shifts per pretraining run and is published with each checkpoint as a manifest hash. License compliance is enforced at the manifest level: every source carries a license tag and incompatible sources are excluded by the build pipeline.

The embodied demonstration share is the codex's unique contribution. No centralized lab has access to a comparable distributed source of embodied data, and the share grows monotonically as the validator network expands.

Objectives

Pretraining minimizes a weighted sum of objectives:

$\mathcal{L}_{\mathrm{pre}} = \lambda_1 \mathcal{L}_{\mathrm{LM}} + \lambda_2 \mathcal{L}_{\mathrm{vision}} + \lambda_3 \mathcal{L}_{\mathrm{align}} + \lambda_4 \mathcal{L}_{\mathrm{world}} + \lambda_5 \mathcal{L}_{\mathrm{embod}}$

with:

$\mathcal{L}_{\mathrm{LM}}$ : next-token cross-entropy on text + code.
$\mathcal{L}_{\mathrm{vision}}$ : masked-patch reconstruction + contrastive.
$\mathcal{L}_{\mathrm{align}}$ : cross-modal contrastive between vision and language latents.
$\mathcal{L}_{\mathrm{world}}$ : next-chunk latent prediction conditioned on a learned null action.
$\mathcal{L}_{\mathrm{embod}}$ : flow-matching on embodied demonstration trajectories.

The loss weights $\lambda_i$ are tuned per phase to balance gradient norms across objectives; a Lagrangian-style auto-tuner adjusts them on the fly to keep no single objective from dominating.

Optimization

Pretraining uses AdamW with the standard cosine schedule:

$\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\!\frac{\pi t}{T}\right)$

after a linear warmup over the first 2% of steps. Gradient clipping at $\|\nabla\|_2 = 1.0$ . Batch size scales with compute, capped at the critical batch size for the data distribution.

Distributed training

Pretraining at scale is sharded with 3D parallelism: data parallel across replicas, tensor parallel within nodes, pipeline parallel across stages. The codex uses a custom topology adapter that auto-discovers the validator cluster's interconnect and chooses sharding to maximize HBM bandwidth utilization.

This is the most network-intensive computation in the system and is reserved for validators in the highest hardware tier; see Validators.

Checkpoints and lineage

Every pretraining run produces a sequence of checkpoints, each carrying:

A content hash of the weights.
A manifest hash of the training data.
A hash chain back to the starting checkpoint (the run's lineage).
An eval-suite score vector against the standard benchmarks.

Lineage is verifiable: any third party can re-derive the checkpoint hash from the lineage and a small set of intermediate proofs.