OGI

Reinforcement Learning & Self-Play

Reinforcement learning is applied after pretraining and supervised fine-tuning. It is the mechanism by which surface behavior is shaped toward outcomes rather than mere statistical fit to demonstration data.

Policy gradient

The base objective is to maximize expected return:

J(θ)=Eτπθ[t=0Tγtr(st,at)].J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigg[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\bigg].

Its gradient is

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)A^t]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigg[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_t\bigg]

with advantage estimates A^t=Qπθ(st,at)Vπθ(st)\hat{A}_t = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t).

Variance reduction

Naive policy gradient has variance that grows with trajectory length. The codex uses generalized advantage estimation:

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{\mathrm{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}

with TD-residuals δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t). Typical (γ,λ)=(0.99,0.95)(\gamma, \lambda) = (0.99, 0.95).

Clipped surrogate objective

Updates are stabilized with PPO's clipped objective:

LCLIP(θ)=Et[min(rt(θ)A^t, clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\big)\Big]

with rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\mathrm{old}}}(a_t|s_t) and ϵ=0.2\epsilon = 0.2.

PPO is the codex's default; for embodied tasks with continuous actions it produces stable updates at the batch sizes typical of the network's distributed-rollout regime.

Self-play

For tasks with adversarial or multi-agent structure, self-play is used. The policy plays against a frozen or stochastically-sampled prior version of itself; outcomes drive policy gradient as above. The key theoretical result is that self-play in zero-sum games converges to a Nash equilibrium under counterfactual-regret-style updates.

For OGI's purposes, self-play is most relevant in:

  • Code generation against a held-out test oracle (a degenerate self-play in which the "opponent" is the evaluator).
  • Embodied multi-agent simulation (e.g., collaborative manipulation, adversarial perturbation training).
  • Plan refinement (one model produces a plan, another critiques it; the critic's reward is whether its critique improved the plan's outcome).

Reward models

For tasks lacking a programmatic reward signal, a learned reward model RψR_\psi is used. The reward model is trained on preference comparisons:

LRM(ψ)=E(x,yw,yl)D[logσ(Rψ(x,yw)Rψ(x,yl))]\mathcal{L}_{\mathrm{RM}}(\psi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\Big[\log\sigma\big(R_\psi(x, y_w) - R_\psi(x, y_l)\big)\Big]

where ywy_w is the preferred output, yly_l the dispreferred. The trained reward model substitutes for r(s,a)r(s, a) in the policy gradient.

Reward models drift: as the policy improves, it discovers responses outside the reward model's training distribution and may receive spurious high rewards. The codex addresses this with periodic reward-model retraining on policy outputs and with explicit KL penalties to the supervised-fine-tuned reference policy.

Outcome-based RL for embodiment

Embodied tasks with programmatic outcome signals (object placed correctly, terrain crossed without falling) use the outcome directly as reward. The reward is sparse, typically a binary at episode termination, and is densified through:

  • Auxiliary shaped rewards during early training, annealed to zero.
  • Sub-task milestones detected by the Reasoning surface.
  • World-model-imagined rollouts that score partial trajectories.

KL constraints

All RL stages constrain the policy to remain close to a reference policy:

maxθ J(πθ)βKL(πθπref).\max_\theta\ J(\pi_\theta) - \beta \cdot \mathrm{KL}\big(\pi_\theta \,\|\, \pi_{\mathrm{ref}}\big).

The reference policy is typically the supervised-fine-tuned base. The KL penalty serves two functions: it prevents reward hacking, and it preserves the broad competence acquired in pretraining.

What RL does not do in this codex

It does not replace imitation. The flow-matching imitation objective from Manipulation is the dominant training signal; RL is a residual on top. The codex deliberately rejects pure-from-scratch RL at scale because the sample complexity is prohibitive even with the network's distributed rollouts.