Manipulation

Manipulation is the embodiment regime in which the agent's end-effector contacts and reconfigures objects in the environment. It is the most studied robotic regime and the most demanding of the surfaces specified in this codex.

What manipulation requires

A manipulation policy must, simultaneously:

Maintain force closure on grasped objects under uncertainty.
Plan around occlusions and partial observability.
Generalize across object geometries unseen at training time.
Recover from failures (slipped grasps, missed targets) without external reset.

The codex's manipulation stack uses the VLA architecture in VLA Architecture, specialized along three axes documented below.

Action representation

Manipulation actions are represented as end-effector deltas in $\mathrm{SE}(3)$ paired with a continuous gripper command:

$a_t = (\Delta \mathbf{p}_t,\ \Delta \mathbf{R}_t,\ g_t)\in \mathbb{R}^3 \times \mathfrak{so}(3) \times \mathbb{R}$

Rotations are expressed in the tangent space $\mathfrak{so}(3)$ via the exponential map; this avoids the discontinuities of quaternion or Euler representations under continuous prediction.

Action chunks of length $H = 16$ are emitted at 23 Hz, yielding a planning horizon of approximately 700 ms per chunk.

Contact-rich behaviors

For contact-rich tasks (insertion, screwing, wiping), the physics stream of the action transformer is active. Tactile and force/torque inputs are encoded at a higher rate than vision (200 Hz for force/torque, vs 30 Hz for vision) and injected into the P stream of VLA Architecture.

Empirically, the physics stream adds approximately 8 percentage points to task success on standard contact-rich benchmarks; the marginal improvement on contact-free tasks is below 1 point. The gate in the multi-stream attention learns to attenuate P-stream contributions when contact is absent.

Imitation + residual RL

Manipulation policies are trained in two stages.

Stage 1: Imitation. The flow-matching loss of VLA Architecture is minimized against a corpus of demonstration trajectories $\mathcal{D} = \{(o_{1:T_i}, a_{1:T_i})\}_i$ :

$\mathcal{L}_{\mathrm{IL}}(\theta) = \mathbb{E}_{(o, a) \sim \mathcal{D}, t \sim \mathcal{U}[0,1], a_1 \sim \mathcal{N}}\big\|\hat{v}_\theta(a_t, t \mid o) - (a_1 - a)\big\|^2.$

This stage produces a competent base policy that achieves moderate success on in-distribution tasks but generalizes poorly to novel geometries.

Stage 2: Residual RL. A small residual policy $\Delta\pi$ is added on top of the imitation base and trained with a constrained policy gradient:

$\nabla_\phi J(\Delta\pi_\phi) = \mathbb{E}\left[\nabla_\phi \log \Delta\pi_\phi(\Delta a_t \mid s_t)\ \hat{A}_t\right]$

with advantage estimates $\hat{A}_t$ from a learned value function and a KL constraint $\mathrm{KL}(\Delta\pi_\phi \| \pi_{\mathrm{base}}) \leq \delta$ . This stage closes the long tail of failure modes; ablations report 12–18 percentage points of additional success on object generalization.

Long-horizon decomposition

Tasks beyond approximately one second of execution are decomposed into subgoals by the Reasoning surface. The manipulation policy is then invoked once per subgoal with a refreshed goal latent. This hierarchical decomposition is necessary because:

Flow matching cannot reliably produce trajectories longer than one chunk's horizon without re-conditioning.
World-model rollouts beyond 16 chunks accumulate error super-linearly (see World Models).

Sim-to-real

Manipulation training mixes simulated and real demonstrations at a ratio of approximately 3:1 simulated to real. Sim-to-real transfer relies on:

Domain randomization: physical parameters, lighting, camera placement.
System identification residuals: an embodiment-specific calibration adapter trained on a few hundred real-world episodes per robot.
Tactile augmentation: synthetic tactile signals from a learned tactile simulator, replacing the limited tactile sims available natively.

The simulated-to-real gap is measured per benchmark; the current operating point is that policies achieving $\geq 80\%$ success in simulation typically achieve $60$ – $75\%$ in the real world without per-robot fine-tuning.

Evaluation suites

Manipulation policies are evaluated on a rotating set of public benchmarks:

LIBERO: long-horizon tabletop manipulation, 4 task suites.
RoboCasa: kitchen and household manipulation.
SimplerEnv: real-to-sim evaluation harness for cross-comparison.

Scores from these suites are published on-chain alongside each checkpoint.