Locomotion

Locomotion is the embodiment regime in which the agent's body translates through the environment. It includes bipedal, quadrupedal, wheeled, and aerial bases. Locomotion is treated separately from manipulation because its dynamics, control frequencies, and failure modes are structurally distinct.

Control hierarchy

Locomotion is controlled at three frequencies, each implemented by a different module:

Layer	Frequency	Output	Trained by
High-level	5 Hz	Subgoal velocity / heading	Imitation + reasoning
Mid-level	50 Hz	Joint target positions	Imitation + RL
Low-level	500–1000 Hz	Joint torques	Optimal control or learned residual

The high-level layer is the VLA architecture of VLA Architecture. The mid-level layer is a smaller, distilled policy specialized for the embodiment. The low-level layer is a controller, typically a PD controller with feedforward gravity compensation, or a learned residual on top of an analytical model.

This separation reflects a measurable fact: a single end-to-end network cannot satisfy the latency demands of a 1 kHz low-level loop while also performing the high-level reasoning of a 5 Hz planning loop. Composition is the only viable solution.

Mid-level policy

The mid-level locomotion policy is a recurrent transformer trained with reinforcement learning in simulation, using proximal policy optimization with a clipped surrogate objective:

$\mathcal{L}^{\text{CLIP}}(\phi) = \mathbb{E}_t \Big[\min\big(r_t(\phi) \hat{A}_t,\ \mathrm{clip}(r_t(\phi),\, 1-\epsilon,\, 1+\epsilon)\, \hat{A}_t\big)\Big]$

with importance ratio $r_t(\phi) = \pi_\phi(a_t \mid s_t) / \pi_{\phi_{\mathrm{old}}}(a_t \mid s_t)$ and advantage estimates $\hat{A}_t$ from generalized advantage estimation.

The reward composition for bipedal walking is illustrative:

$r_t = \alpha_1 r^{\mathrm{vel}}_t + \alpha_2 r^{\mathrm{stab}}_t - \alpha_3 c^{\mathrm{torque}}_t - \alpha_4 c^{\mathrm{slip}}_t - \alpha_5 c^{\mathrm{posture}}_t$

with velocity tracking, stability, torque cost, slip penalty, and posture regularization. Coefficients are domain-specific and published per checkpoint.

Teacher-student distillation

Locomotion policies are trained as teachers in simulation with privileged observations (ground-truth body state, terrain heightmap) and distilled into students with realistic observations (proprioception + IMU + downward camera). The distillation loss is

$\mathcal{L}_{\mathrm{distill}} = \mathbb{E}_{s \sim \rho_\pi}\big[\mathrm{KL}\big(\pi_{\mathrm{teacher}}(\cdot \mid s^{\mathrm{priv}}_t)\ \|\ \pi_{\mathrm{student}}(\cdot \mid s^{\mathrm{obs}}_t)\big)\big].$

This is the standard teacher-student pipeline and is responsible for the headline gains in legged locomotion since 2020.

Stability constraints

Locomotion policies cannot be trained purely on outcome reward. The reward signal is too sparse and the failure mode is catastrophic (falling), so the codex specifies safety as a hard constraint on the policy:

$\max_\phi\ J(\pi_\phi)\quad \text{subject to}\quad \Pr(\mathrm{fall}\mid \pi_\phi) \leq \delta$

implemented as a Lagrangian-constrained policy gradient. The constraint is enforced both at training time (penalty proportional to predicted fall probability) and at deployment time (a separate fall-prediction model triggers a fallback controller before the policy can commit the catastrophic action).

Terrain generalization

Real-world locomotion confronts terrain absent from any training distribution. The codex's response is curriculum randomization: simulated terrains are sampled from a parameterized family (slope, friction, deformability, obstacle density) with difficulty scaled to the agent's current success rate. This curriculum is what produces the empirically robust policies of recent legged locomotion work.

Coupling with manipulation

A humanoid agent must locomote and manipulate concurrently. The codex's solution is loose coupling: locomotion and manipulation are separate policies invoked over a shared body model, with a small coordination policy resolving conflicts (e.g., the manipulation policy's end-effector target induces a base-translation requirement that the locomotion policy must execute).

The coordination policy is trained end-to-end on whole-body tasks; ablations show it is necessary for humanoid-scale embodiments and unnecessary for fixed-base or simple-mobile embodiments.