OGI

Theoretical Foundations

This appendix collects the formal derivations that the main chapters reference but do not derive. The proofs are sketches sufficient for verification, not full treatments; full constructions are within the codex's body of work and reproducible from the definitions given.

A.1 Flow matching as a generative objective

Claim. Training a neural network v^θ(xt,t)\hat{v}_\theta(x_t, t) to predict the conditional velocity field

vt(xx1)=ddt[(1t)x0+tx1]=x1x0v_t(x \mid x_1) = \frac{d}{dt}\big[(1-t) x_0 + t x_1\big] = x_1 - x_0

under the loss

LFM(θ)=Et,x0,x1v^θ(xt,t)(x1x0)2\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t,\, x_0,\, x_1}\big\|\hat{v}_\theta(x_t, t) - (x_1 - x_0)\big\|^2

yields a model whose ODE trajectories transport the noise prior to the data distribution.

Sketch. Define the marginal velocity ut(x)=E[vt(xx1)xt=x]u_t(x) = \mathbb{E}[v_t(x \mid x_1) \mid x_t = x]. Solving dx/dt=ut(x)dx/dt = u_t(x) from x0p0x_0 \sim p_0 at t=0t=0 produces x1p1x_1 \sim p_1 at t=1t=1 by construction. The flow-matching loss is the conditional-expectation regression target for utu_t; minimizing it produces v^θut\hat{v}_\theta \to u_t in the L2L^2 sense, and the resulting ODE is the desired transport.

Consequence for the codex. The action head of the embodiment surface is a flow-matching model. The ODE solver at inference is Euler with four steps; the residual error is bounded by the step size and the Lipschitz constant of v^θ\hat{v}_\theta, both of which are observable in training logs.

A.2 Scaling-law compute optimum

Claim. Under a parametric loss model

L(N,D)=E+ANα+BDβ\mathcal{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

with α,β>0\alpha, \beta > 0 and compute CNDC \propto N D, the compute-optimal allocation satisfies

ND=αAβB(DN)βα.\frac{N^*}{D^*} = \frac{\alpha A}{\beta B}\left(\frac{D^*}{N^*}\right)^{\beta - \alpha}.

For αβ\alpha \approx \beta, this reduces to NC1/2N^* \propto C^{1/2}, DC1/2D^* \propto C^{1/2}.

Sketch. Lagrange multipliers on minL(N,D)\min \mathcal{L}(N, D) subject to ND=C/kN D = C/k for fixed FLOP-per-token constant kk. Setting NL=μD\partial_N \mathcal{L} = \mu D and DL=μN\partial_D \mathcal{L} = \mu N, dividing, and rearranging produces the displayed condition.

Consequence for the codex. Pretraining allocates compute near the optimum, adjusted downward in parameter count to reflect the codex's commitment to long inference lifetimes; the lifetime adjustment is derived in the deployment-cost annex.

A.3 PPO clipped surrogate, monotonic improvement

Claim. The clipped surrogate

LCLIP(θ)=Et[min(rt(θ)A^t, clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta) \hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\, \hat{A}_t\big)\Big]

provides a lower bound on the trust-region objective of TRPO, and maximizing it produces approximately monotonic policy improvement.

Sketch. For A^t>0\hat{A}_t > 0, the surrogate is bounded above by (1+ϵ)A^t(1+\epsilon) \hat{A}_t, removing incentive to push the importance ratio beyond 1+ϵ1+\epsilon. For A^t<0\hat{A}_t < 0, the surrogate is symmetrically bounded below by (1ϵ)A^t(1-\epsilon) \hat{A}_t. The result is a first-order Taylor approximation to the trust-region objective with a soft clipping penalty replacing the explicit KL constraint; monotonic improvement carries from the trust-region setting to PPO under this bound.

Consequence for the codex. Locomotion and residual-RL training in Manipulation use PPO with ϵ=0.2\epsilon = 0.2.

A.4 DPO equivalence to RLHF

Claim. The RLHF objective

maxπExD,yπ[r(x,y)]βKL(π(x)  πref(x))\max_\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi}[r(x, y)] - \beta\, \mathrm{KL}\big(\pi(\cdot|x)\ \|\ \pi_{\mathrm{ref}}(\cdot|x)\big)

has a closed-form optimal policy

π(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y)).\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y|x) \exp\!\left(\frac{1}{\beta} r(x, y)\right).

Inverting this for rr, substituting into the Bradley-Terry preference model, and integrating out Z(x)Z(x) yields the DPO loss

LDPO(πθ)=E ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}\!\left[\log \sigma\!\left(\beta \log\frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta \log\frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right]

with no explicit reward model.

Sketch. Solving the KL-regularized maximization yields the exponential-tilted policy. Bradley-Terry preferences P(ywylx)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l)) then yield the DPO loss after the rewards are expressed in terms of the optimal policy log-ratios.

Consequence for the codex. Alignment training in Alignment uses DPO as the dominant preference-tuning loss.

A.5 Honesty condition for cryptoeconomic validation

Claim. Suppose a validator can choose to cheat (produce a false output) at gain G>0G > 0. Suppose cheating is detected with probability p(0,1]p \in (0, 1] and detection produces a slash of S>0S > 0. Suppose the validator's discount factor is δ(0,1)\delta \in (0, 1) and per-period legitimate income is W>0W > 0. Then the validator's strict best response is honest iff

pS>G+δ1δW1[slash terminates participation].p \cdot S > G + \frac{\delta}{1-\delta} \cdot W \cdot \mathbb{1}[\text{slash terminates participation}].

Sketch. Compare two strategies. Honest: receives WW forever, value W/(1δ)W/(1-\delta). Cheat once then return to honest: receives GG now, expected loss pSpS now, and (if slashing terminates participation) loses future W/(1δ)W/(1-\delta). The honest strategy strictly dominates iff

W1δ>GpS+(1p)δW1δ\frac{W}{1-\delta} > G - pS + (1 - p)\cdot \frac{\delta W}{1-\delta}

which simplifies under termination to the displayed condition. Without termination, the condition reduces to pS>GpS > G.

Consequence for the codex. Validator stake SS and audit probability pp are jointly set so the displayed condition holds with margin against the largest plausible GG for each surface. See Validators.

A.6 Catastrophic forgetting bound under elastic weight consolidation

Claim. Under the quadratic penalty

LBA(θ)=LB(θ)+λ2iFi(θiθiA)2\mathcal{L}_{B|A}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_i^A)^2

with FiF_i the diagonal of the empirical Fisher information at θA\theta^A, the expected post-update loss on task AA is bounded by

E[LA(θBA)]LA(θA)+12λtr(F1LBLB)+o(θBAθA3).\mathbb{E}[\mathcal{L}_A(\theta_{B|A})] \leq \mathcal{L}_A(\theta^A) + \frac{1}{2\lambda} \mathrm{tr}(F^{-1}\, \nabla\mathcal{L}_B\, \nabla\mathcal{L}_B^\top) + o(\|\theta_{B|A} - \theta^A\|^3).

Sketch. Second-order Taylor of LA\mathcal{L}_A around θA\theta^A: the gradient vanishes at the optimum and the Hessian is approximated by the Fisher information for a probabilistic model under regularity conditions. Substituting and bounding by the regularization strength yields the displayed bound.

Consequence for the codex. EWC is the dominant continual-learning regularizer in Continual Learning. The bound makes the trade-off explicit: increasing λ\lambda tightens the bound at the cost of LB\mathcal{L}_B.

A.7 Attention complexity

Standard result. Multi-head attention on a sequence of length NN with model dimension dd has time complexity O(N2d)O(N^2 d) and memory complexity O(N2+Nd)O(N^2 + Nd).

Consequence for the codex. The fixed-count cognition-token bottleneck in VLA Architecture reduces the cross-stream attention from O(N2)O(N^2) in the latent length to O(KN)O(KN) where KK is the cognition-token count. For K=64K = 64 and N4000N \sim 4000, this is approximately a 30×30\times reduction in attention FLOPs.