Alignment

Alignment is the set of techniques and constraints that shape model behavior toward intended objectives, intended being a governance variable, not a foundational assumption.

The alignment object

The codex defines alignment as the constraint that, for any input $x$ in the deployment distribution and any output $y$ produced by the model,

$y \in \mathcal{A}(x)$

where $\mathcal{A}(x)$ is the acceptance set, the set of outputs consistent with the policy's declared values and constraints. The acceptance set is not a fixed function; it is a published artifact updated by governance.

Alignment is therefore not a single training procedure. It is a triple: (declared values, training procedure, deployment-time enforcement). The codex specifies each.

Declared values

Values are declared as a structured document of constraints and preferences:

Constraints are inviolable: the model refuses to act outside them, even at the cost of task failure. Example: "Do not produce content optimized for biological weapons synthesis."
Preferences are soft: the model prefers outputs within them but may produce others under sufficient countervailing reason. Example: "Prefer outputs that decline rather than fabricate when uncertain."

The values document is on-chain. Changes require a constitutional-class governance proposal.

Training procedures

Alignment training proceeds in three stages.

Stage 1: Supervised fine-tuning (SFT). The base model is fine-tuned on demonstrations of aligned behavior. Loss is standard cross-entropy:

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\mathbb{E}_{(x, y) \sim \mathcal{D}_{\mathrm{align}}}\Big[\log \pi_\theta(y \mid x)\Big].$

Stage 2: Reward-model-free preference optimization (DPO). The codex uses direct preference optimization rather than RLHF-with-reward-model. Given preference pairs $(x, y_w, y_l)$ :

$\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right].$

DPO avoids training a separate reward model and the policy-drift instabilities of RLHF, at the cost of requiring preference data that better covers the policy's actual output distribution. Both DPO and RLHF have been validated for the codex's regime; DPO is preferred for its operational simplicity and its tractable theoretical guarantees.

Stage 3: Constitutional self-critique. The model evaluates its own outputs against the values document and, when self-critique flags a violation, regenerates. This is computed at training time as an auxiliary loss and at inference time as a configurable filter.

Deployment-time enforcement

Training alone cannot guarantee alignment. The codex layers three deployment-time checks:

Pre-emission filter. Output candidates are scored against a small (≤ 1B param) alignment classifier; candidates below threshold are resampled or refused.
Surface-specific constraints. Each surface enforces its own constraints (e.g., the embodiment surface refuses motor commands that exceed declared force limits).
Audit logging. A configurable fraction of outputs are logged with their full latent state for post-hoc review.

These checks are observable, configurable, and themselves subject to governance.

Why DPO over RLHF

Two reasons.

Stability. RLHF optimizes a policy against a learned reward model. As the policy improves, it can exploit reward-model errors, reward hacking. DPO trains the policy directly against the preference data, removing the intermediate optimization target.

Decentralization. A reward model is a piece of infrastructure that must be hosted, served, and kept in sync. DPO requires only the preference dataset, which is content-addressed.

Verification

Alignment claims are verifiable:

Constraint coverage. A frozen suite of red-team prompts is run against every alignment-trained checkpoint; the failure rate is published.
Preference adherence. A held-out preference set measures DPO's preference satisfaction rate.
Refusal calibration. A test set with known-acceptable and known-unacceptable prompts measures the rate of incorrect refusal and incorrect compliance separately.

A checkpoint that regresses on any of these is not promoted to the active routing table.

Limits

The codex does not claim that alignment is solved by these techniques. It claims that they are the current best-available techniques and that the network is engineered to integrate stronger techniques as they are developed. Specifically:

Alignment of capabilities the model does not yet have (e.g., long-horizon planning at superhuman scale) is unaddressed by SFT + DPO + constitutional methods. The codex's response is conservative: deployment is restricted to capabilities for which alignment has been empirically validated.
Alignment under distribution shift remains an open problem.

These limits are stated in the values document and referenced by the deployment-time enforcement layer.