Theoretical Foundations
This appendix collects the formal derivations that the main chapters reference but do not derive. The proofs are sketches sufficient for verification, not full treatments; full constructions are within the codex's body of work and reproducible from the definitions given.
A.1 Flow matching as a generative objective
Claim. Training a neural network to predict the conditional velocity field
under the loss
yields a model whose ODE trajectories transport the noise prior to the data distribution.
Sketch. Define the marginal velocity . Solving from at produces at by construction. The flow-matching loss is the conditional-expectation regression target for ; minimizing it produces in the sense, and the resulting ODE is the desired transport.
Consequence for the codex. The action head of the embodiment surface is a flow-matching model. The ODE solver at inference is Euler with four steps; the residual error is bounded by the step size and the Lipschitz constant of , both of which are observable in training logs.
A.2 Scaling-law compute optimum
Claim. Under a parametric loss model
with and compute , the compute-optimal allocation satisfies
For , this reduces to , .
Sketch. Lagrange multipliers on subject to for fixed FLOP-per-token constant . Setting and , dividing, and rearranging produces the displayed condition.
Consequence for the codex. Pretraining allocates compute near the optimum, adjusted downward in parameter count to reflect the codex's commitment to long inference lifetimes; the lifetime adjustment is derived in the deployment-cost annex.
A.3 PPO clipped surrogate, monotonic improvement
Claim. The clipped surrogate
provides a lower bound on the trust-region objective of TRPO, and maximizing it produces approximately monotonic policy improvement.
Sketch. For , the surrogate is bounded above by , removing incentive to push the importance ratio beyond . For , the surrogate is symmetrically bounded below by . The result is a first-order Taylor approximation to the trust-region objective with a soft clipping penalty replacing the explicit KL constraint; monotonic improvement carries from the trust-region setting to PPO under this bound.
Consequence for the codex. Locomotion and residual-RL training in Manipulation use PPO with .
A.4 DPO equivalence to RLHF
Claim. The RLHF objective
has a closed-form optimal policy
Inverting this for , substituting into the Bradley-Terry preference model, and integrating out yields the DPO loss
with no explicit reward model.
Sketch. Solving the KL-regularized maximization yields the exponential-tilted policy. Bradley-Terry preferences then yield the DPO loss after the rewards are expressed in terms of the optimal policy log-ratios.
Consequence for the codex. Alignment training in Alignment uses DPO as the dominant preference-tuning loss.
A.5 Honesty condition for cryptoeconomic validation
Claim. Suppose a validator can choose to cheat (produce a false output) at gain . Suppose cheating is detected with probability and detection produces a slash of . Suppose the validator's discount factor is and per-period legitimate income is . Then the validator's strict best response is honest iff
Sketch. Compare two strategies. Honest: receives forever, value . Cheat once then return to honest: receives now, expected loss now, and (if slashing terminates participation) loses future . The honest strategy strictly dominates iff
which simplifies under termination to the displayed condition. Without termination, the condition reduces to .
Consequence for the codex. Validator stake and audit probability are jointly set so the displayed condition holds with margin against the largest plausible for each surface. See Validators.
A.6 Catastrophic forgetting bound under elastic weight consolidation
Claim. Under the quadratic penalty
with the diagonal of the empirical Fisher information at , the expected post-update loss on task is bounded by
Sketch. Second-order Taylor of around : the gradient vanishes at the optimum and the Hessian is approximated by the Fisher information for a probabilistic model under regularity conditions. Substituting and bounding by the regularization strength yields the displayed bound.
Consequence for the codex. EWC is the dominant continual-learning regularizer in Continual Learning. The bound makes the trade-off explicit: increasing tightens the bound at the cost of .
A.7 Attention complexity
Standard result. Multi-head attention on a sequence of length with model dimension has time complexity and memory complexity .
Consequence for the codex. The fixed-count cognition-token bottleneck in VLA Architecture reduces the cross-stream attention from in the latent length to where is the cognition-token count. For and , this is approximately a reduction in attention FLOPs.