Theoretical Foundations

This appendix collects the formal derivations that the main chapters reference but do not derive. The proofs are sketches sufficient for verification, not full treatments; full constructions are within the codex's body of work and reproducible from the definitions given.

A.1 Flow matching as a generative objective

Claim. Training a neural network $\hat{v}_\theta(x_t, t)$ to predict the conditional velocity field

$v_t(x \mid x_1) = \frac{d}{dt}\big[(1-t) x_0 + t x_1\big] = x_1 - x_0$

under the loss

$\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t,\, x_0,\, x_1}\big\|\hat{v}_\theta(x_t, t) - (x_1 - x_0)\big\|^2$

yields a model whose ODE trajectories transport the noise prior to the data distribution.

Sketch. Define the marginal velocity $u_t(x) = \mathbb{E}[v_t(x \mid x_1) \mid x_t = x]$ . Solving $dx/dt = u_t(x)$ from $x_0 \sim p_0$ at $t=0$ produces $x_1 \sim p_1$ at $t=1$ by construction. The flow-matching loss is the conditional-expectation regression target for $u_t$ ; minimizing it produces $\hat{v}_\theta \to u_t$ in the $L^2$ sense, and the resulting ODE is the desired transport.

Consequence for the codex. The action head of the embodiment surface is a flow-matching model. The ODE solver at inference is Euler with four steps; the residual error is bounded by the step size and the Lipschitz constant of $\hat{v}_\theta$ , both of which are observable in training logs.

A.2 Scaling-law compute optimum

Claim. Under a parametric loss model

$\mathcal{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$

with $\alpha, \beta > 0$ and compute $C \propto N D$ , the compute-optimal allocation satisfies

$\frac{N^*}{D^*} = \frac{\alpha A}{\beta B}\left(\frac{D^*}{N^*}\right)^{\beta - \alpha}.$

For $\alpha \approx \beta$ , this reduces to $N^* \propto C^{1/2}$ , $D^* \propto C^{1/2}$ .

Sketch. Lagrange multipliers on $\min \mathcal{L}(N, D)$ subject to $N D = C/k$ for fixed FLOP-per-token constant $k$ . Setting $\partial_N \mathcal{L} = \mu D$ and $\partial_D \mathcal{L} = \mu N$ , dividing, and rearranging produces the displayed condition.

Consequence for the codex. Pretraining allocates compute near the optimum, adjusted downward in parameter count to reflect the codex's commitment to long inference lifetimes; the lifetime adjustment is derived in the deployment-cost annex.

A.3 PPO clipped surrogate, monotonic improvement

Claim. The clipped surrogate

$\mathcal{L}^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta) \hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\, \hat{A}_t\big)\Big]$

provides a lower bound on the trust-region objective of TRPO, and maximizing it produces approximately monotonic policy improvement.

Sketch. For $\hat{A}_t > 0$ , the surrogate is bounded above by $(1+\epsilon) \hat{A}_t$ , removing incentive to push the importance ratio beyond $1+\epsilon$ . For $\hat{A}_t < 0$ , the surrogate is symmetrically bounded below by $(1-\epsilon) \hat{A}_t$ . The result is a first-order Taylor approximation to the trust-region objective with a soft clipping penalty replacing the explicit KL constraint; monotonic improvement carries from the trust-region setting to PPO under this bound.

Consequence for the codex. Locomotion and residual-RL training in Manipulation use PPO with $\epsilon = 0.2$ .

A.4 DPO equivalence to RLHF

Claim. The RLHF objective

$\max_\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi}[r(x, y)] - \beta\, \mathrm{KL}\big(\pi(\cdot|x)\ \|\ \pi_{\mathrm{ref}}(\cdot|x)\big)$

has a closed-form optimal policy

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y|x) \exp\!\left(\frac{1}{\beta} r(x, y)\right).$

Inverting this for $r$ , substituting into the Bradley-Terry preference model, and integrating out $Z(x)$ yields the DPO loss

$\mathcal{L}_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}\!\left[\log \sigma\!\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right]$

with no explicit reward model.

Sketch. Solving the KL-regularized maximization yields the exponential-tilted policy. Bradley-Terry preferences $P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))$ then yield the DPO loss after the rewards are expressed in terms of the optimal policy log-ratios.

Consequence for the codex. Alignment training in Alignment uses DPO as the dominant preference-tuning loss.

A.5 Honesty condition for cryptoeconomic validation

Claim. Suppose a validator can choose to cheat (produce a false output) at gain $G > 0$ . Suppose cheating is detected with probability $p \in (0, 1]$ and detection produces a slash of $S > 0$ . Suppose the validator's discount factor is $\delta \in (0, 1)$ and per-period legitimate income is $W > 0$ . Then the validator's strict best response is honest iff

$p \cdot S > G + \frac{\delta}{1-\delta} \cdot W \cdot \mathbb{1}[\text{slash terminates participation}].$

Sketch. Compare two strategies. Honest: receives $W$ forever, value $W/(1-\delta)$ . Cheat once then return to honest: receives $G$ now, expected loss $pS$ now, and (if slashing terminates participation) loses future $W/(1-\delta)$ . The honest strategy strictly dominates iff

$\frac{W}{1-\delta} > G - pS + (1 - p)\cdot \frac{\delta W}{1-\delta}$

which simplifies under termination to the displayed condition. Without termination, the condition reduces to $pS > G$ .

Consequence for the codex. Validator stake $S$ and audit probability $p$ are jointly set so the displayed condition holds with margin against the largest plausible $G$ for each surface. See Validators.

A.6 Catastrophic forgetting bound under elastic weight consolidation

Claim. Under the quadratic penalty

$\mathcal{L}_{B|A}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_i^A)^2$

with $F_i$ the diagonal of the empirical Fisher information at $\theta^A$ , the expected post-update loss on task $A$ is bounded by

$\mathbb{E}[\mathcal{L}_A(\theta_{B|A})] \leq \mathcal{L}_A(\theta^A) + \frac{1}{2\lambda} \mathrm{tr}(F^{-1}\, \nabla\mathcal{L}_B\, \nabla\mathcal{L}_B^\top) + o(\|\theta_{B|A} - \theta^A\|^3).$

Sketch. Second-order Taylor of $\mathcal{L}_A$ around $\theta^A$ : the gradient vanishes at the optimum and the Hessian is approximated by the Fisher information for a probabilistic model under regularity conditions. Substituting and bounding by the regularization strength yields the displayed bound.

Consequence for the codex. EWC is the dominant continual-learning regularizer in Continual Learning. The bound makes the trade-off explicit: increasing $\lambda$ tightens the bound at the cost of $\mathcal{L}_B$ .

A.7 Attention complexity

Standard result. Multi-head attention on a sequence of length $N$ with model dimension $d$ has time complexity $O(N^2 d)$ and memory complexity $O(N^2 + Nd)$ .

Consequence for the codex. The fixed-count cognition-token bottleneck in VLA Architecture reduces the cross-stream attention from $O(N^2)$ in the latent length to $O(KN)$ where $K$ is the cognition-token count. For $K = 64$ and $N \sim 4000$ , this is approximately a $30\times$ reduction in attention FLOPs.