Locomotion
Locomotion is the embodiment regime in which the agent's body translates through the environment. It includes bipedal, quadrupedal, wheeled, and aerial bases. Locomotion is treated separately from manipulation because its dynamics, control frequencies, and failure modes are structurally distinct.
Control hierarchy
Locomotion is controlled at three frequencies, each implemented by a different module:
| Layer | Frequency | Output | Trained by |
|---|---|---|---|
| High-level | 5 Hz | Subgoal velocity / heading | Imitation + reasoning |
| Mid-level | 50 Hz | Joint target positions | Imitation + RL |
| Low-level | 500–1000 Hz | Joint torques | Optimal control or learned residual |
The high-level layer is the VLA architecture of VLA Architecture. The mid-level layer is a smaller, distilled policy specialized for the embodiment. The low-level layer is a controller, typically a PD controller with feedforward gravity compensation, or a learned residual on top of an analytical model.
This separation reflects a measurable fact: a single end-to-end network cannot satisfy the latency demands of a 1 kHz low-level loop while also performing the high-level reasoning of a 5 Hz planning loop. Composition is the only viable solution.
Mid-level policy
The mid-level locomotion policy is a recurrent transformer trained with reinforcement learning in simulation, using proximal policy optimization with a clipped surrogate objective:
with importance ratio and advantage estimates from generalized advantage estimation.
The reward composition for bipedal walking is illustrative:
with velocity tracking, stability, torque cost, slip penalty, and posture regularization. Coefficients are domain-specific and published per checkpoint.
Teacher-student distillation
Locomotion policies are trained as teachers in simulation with privileged observations (ground-truth body state, terrain heightmap) and distilled into students with realistic observations (proprioception + IMU + downward camera). The distillation loss is
This is the standard teacher-student pipeline and is responsible for the headline gains in legged locomotion since 2020.
Stability constraints
Locomotion policies cannot be trained purely on outcome reward. The reward signal is too sparse and the failure mode is catastrophic (falling), so the codex specifies safety as a hard constraint on the policy:
implemented as a Lagrangian-constrained policy gradient. The constraint is enforced both at training time (penalty proportional to predicted fall probability) and at deployment time (a separate fall-prediction model triggers a fallback controller before the policy can commit the catastrophic action).
Terrain generalization
Real-world locomotion confronts terrain absent from any training distribution. The codex's response is curriculum randomization: simulated terrains are sampled from a parameterized family (slope, friction, deformability, obstacle density) with difficulty scaled to the agent's current success rate. This curriculum is what produces the empirically robust policies of recent legged locomotion work.
Coupling with manipulation
A humanoid agent must locomote and manipulate concurrently. The codex's solution is loose coupling: locomotion and manipulation are separate policies invoked over a shared body model, with a small coordination policy resolving conflicts (e.g., the manipulation policy's end-effector target induces a base-translation requirement that the locomotion policy must execute).
The coordination policy is trained end-to-end on whole-body tasks; ablations show it is necessary for humanoid-scale embodiments and unnecessary for fixed-base or simple-mobile embodiments.