Reinforcement Learning & Self-Play
Reinforcement learning is applied after pretraining and supervised fine-tuning. It is the mechanism by which surface behavior is shaped toward outcomes rather than mere statistical fit to demonstration data.
Policy gradient
The base objective is to maximize expected return:
Its gradient is
with advantage estimates .
Variance reduction
Naive policy gradient has variance that grows with trajectory length. The codex uses generalized advantage estimation:
with TD-residuals . Typical .
Clipped surrogate objective
Updates are stabilized with PPO's clipped objective:
with and .
PPO is the codex's default; for embodied tasks with continuous actions it produces stable updates at the batch sizes typical of the network's distributed-rollout regime.
Self-play
For tasks with adversarial or multi-agent structure, self-play is used. The policy plays against a frozen or stochastically-sampled prior version of itself; outcomes drive policy gradient as above. The key theoretical result is that self-play in zero-sum games converges to a Nash equilibrium under counterfactual-regret-style updates.
For OGI's purposes, self-play is most relevant in:
- Code generation against a held-out test oracle (a degenerate self-play in which the "opponent" is the evaluator).
- Embodied multi-agent simulation (e.g., collaborative manipulation, adversarial perturbation training).
- Plan refinement (one model produces a plan, another critiques it; the critic's reward is whether its critique improved the plan's outcome).
Reward models
For tasks lacking a programmatic reward signal, a learned reward model is used. The reward model is trained on preference comparisons:
where is the preferred output, the dispreferred. The trained reward model substitutes for in the policy gradient.
Reward models drift: as the policy improves, it discovers responses outside the reward model's training distribution and may receive spurious high rewards. The codex addresses this with periodic reward-model retraining on policy outputs and with explicit KL penalties to the supervised-fine-tuned reference policy.
Outcome-based RL for embodiment
Embodied tasks with programmatic outcome signals (object placed correctly, terrain crossed without falling) use the outcome directly as reward. The reward is sparse, typically a binary at episode termination, and is densified through:
- Auxiliary shaped rewards during early training, annealed to zero.
- Sub-task milestones detected by the Reasoning surface.
- World-model-imagined rollouts that score partial trajectories.
KL constraints
All RL stages constrain the policy to remain close to a reference policy:
The reference policy is typically the supervised-fine-tuned base. The KL penalty serves two functions: it prevents reward hacking, and it preserves the broad competence acquired in pretraining.
What RL does not do in this codex
It does not replace imitation. The flow-matching imitation objective from Manipulation is the dominant training signal; RL is a residual on top. The codex deliberately rejects pure-from-scratch RL at scale because the sample complexity is prohibitive even with the network's distributed rollouts.