Reinforcement Learning & Self-Play

Reinforcement learning is applied after pretraining and supervised fine-tuning. It is the mechanism by which surface behavior is shaped toward outcomes rather than mere statistical fit to demonstration data.

Policy gradient

The base objective is to maximize expected return:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigg[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\bigg].$

Its gradient is

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigg[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \hat{A}_t\bigg]$

with advantage estimates $\hat{A}_t = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)$ .

Variance reduction

Naive policy gradient has variance that grows with trajectory length. The codex uses generalized advantage estimation:

$\hat{A}_t^{\mathrm{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$

with TD-residuals $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ . Typical $(\gamma, \lambda) = (0.99, 0.95)$ .

Clipped surrogate objective

Updates are stabilized with PPO's clipped objective:

$\mathcal{L}^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\big)\Big]$

with $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\mathrm{old}}}(a_t|s_t)$ and $\epsilon = 0.2$ .

PPO is the codex's default; for embodied tasks with continuous actions it produces stable updates at the batch sizes typical of the network's distributed-rollout regime.

Self-play

For tasks with adversarial or multi-agent structure, self-play is used. The policy plays against a frozen or stochastically-sampled prior version of itself; outcomes drive policy gradient as above. The key theoretical result is that self-play in zero-sum games converges to a Nash equilibrium under counterfactual-regret-style updates.

For OGI's purposes, self-play is most relevant in:

Code generation against a held-out test oracle (a degenerate self-play in which the "opponent" is the evaluator).
Embodied multi-agent simulation (e.g., collaborative manipulation, adversarial perturbation training).
Plan refinement (one model produces a plan, another critiques it; the critic's reward is whether its critique improved the plan's outcome).

Reward models

For tasks lacking a programmatic reward signal, a learned reward model $R_\psi$ is used. The reward model is trained on preference comparisons:

$\mathcal{L}_{\mathrm{RM}}(\psi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\Big[\log\sigma\big(R_\psi(x, y_w) - R_\psi(x, y_l)\big)\Big]$

where $y_w$ is the preferred output, $y_l$ the dispreferred. The trained reward model substitutes for $r(s, a)$ in the policy gradient.

Reward models drift: as the policy improves, it discovers responses outside the reward model's training distribution and may receive spurious high rewards. The codex addresses this with periodic reward-model retraining on policy outputs and with explicit KL penalties to the supervised-fine-tuned reference policy.

Outcome-based RL for embodiment

Embodied tasks with programmatic outcome signals (object placed correctly, terrain crossed without falling) use the outcome directly as reward. The reward is sparse, typically a binary at episode termination, and is densified through:

Auxiliary shaped rewards during early training, annealed to zero.
Sub-task milestones detected by the Reasoning surface.
World-model-imagined rollouts that score partial trajectories.

KL constraints

All RL stages constrain the policy to remain close to a reference policy:

$\max_\theta\ J(\pi_\theta) - \beta \cdot \mathrm{KL}\big(\pi_\theta \,\|\, \pi_{\mathrm{ref}}\big).$

The reference policy is typically the supervised-fine-tuned base. The KL penalty serves two functions: it prevents reward hacking, and it preserves the broad competence acquired in pretraining.

What RL does not do in this codex

It does not replace imitation. The flow-matching imitation objective from Manipulation is the dominant training signal; RL is a residual on top. The codex deliberately rejects pure-from-scratch RL at scale because the sample complexity is prohibitive even with the network's distributed rollouts.