Embodiment: Overview

Embodiment is the surface that converts latent state into physical action. It is treated separately from cognition because its constraints are sharply different: latency is hard-bounded, action spaces are continuous, and verification cannot rely on text matching.

The embodiment problem in one paragraph

Given a stream of multi-modal observations $o_{1:t}$ and a goal latent $g$ , produce a sequence of motor commands $a_{1:T}$ such that the resulting trajectory satisfies the goal. The motor commands live in a high-dimensional, continuous, embodiment-specific space (joint positions, end-effector deltas, base velocities); the observations live in a much higher-dimensional perceptual space. The policy must respect physical constraints, react within the embodiment's control loop period (typically 20–100 Hz), and generalize across embodiments without per-robot retraining.

Formally, the policy is

$\pi_\theta(a_{t:t+H} \mid o_{1:t}, g)$

where $H$ is the action horizon (the policy emits chunks, not single actions; see action chunking in VLA Architecture) and $\theta$ is shared across embodiments via the multi-embodiment scheme.

Why a separate surface

A naive system might treat motor control as another decoder of the language model. This fails for three measurable reasons:

Latency. Autoregressive token emission at 50 tok/s gives an effective control rate well below 5 Hz; most useful manipulation requires 20+ Hz. See $\S$ "Latency budget" in VLA Architecture.
Action distribution. Motor distributions are continuous and multimodal. Softmax-over-tokens is a poor parameterization; flow matching and diffusion are the empirically validated alternatives.
Training signal. Embodied learning has access to outcome supervision (task success) that language does not. The training pipeline must exploit this.

Composition of Part II

The remaining chapters of Part II treat:

VLA architecture, the unified architecture for embodied action: vision + language + action, with multi-stream attention and flow-matching action heads.
Manipulation, grasping, dexterous manipulation, contact-rich tasks.
Locomotion, bipedal, quadrupedal, wheeled and aerial base control.
Multi-embodiment, how a single set of weights serves many robot morphologies.

Sim-to-real transfer, real-time chunking, and safety are treated within these chapters at the points they become relevant.

What this part assumes

The reader has internalized the hourglass (Shape of AGI). The embodiment surface lives below the hourglass on the action side; vision and proprioception encoders live above on the observation side. The shared latent is the interface.