OGI

Multi-Embodiment

A general embodied intelligence operates across many bodies. The codex's policy is to share as much of the model as possible across embodiments and to localize only the irreducibly embodiment-specific components.

What varies across embodiments

For two robots executing the same task:

  • The observation space may differ (number of cameras, presence of tactile sensors, depth modalities).
  • The action space differs in dimensionality and semantics (a 7-DoF arm and a 12-DoF hand have different command vectors; a wheeled base and a bipedal one have entirely different controls).
  • The physical dynamics differ, link lengths, joint limits, mass distributions, contact properties.

What does not vary is the cognitive layer: the goal, the visual perception backbone, the reasoning over latent, the world model. These remain shared.

Shared / specialized partition

The codex partitions the embodied stack as:

ComponentShared across embodimentsSpecialized per embodiment
Vision encoder
Cognitive backbone
Cognition tokens
Action transformer✓ (shared weights)
Modality projectors (state → latent)
Decoder MLP heads (latent → motor)
Low-level controller

The action transformer is shared but its inputs are per-embodiment shaped via the modality projectors. This is the mechanism by which a single set of trained weights serves dozens of robot morphologies.

Embodiment tags

Each embodiment family receives a tag: a learned embedding ekRde_k \in \mathbb{R}^d injected into the cognition tokens at the backbone exit. The tag conditions the action transformer to produce outputs in the correct action space.

The tag mechanism is identical to language-model task tokens, and inherits the same properties: tags can be combined, interpolated, and fine-tuned without retraining the backbone.

Cross-embodiment transfer

Empirically, training the action transformer on a mixture of embodiments produces positive transfer: a policy trained on {A,B,C}\{A, B, C\} achieves higher success on AA than a policy trained on AA alone, with the gain scaling roughly logarithmically in the number of co-trained embodiments. This is the multi-embodiment scaling law observed in recent work.

Let S(N)S(N) denote average success rate over a benchmark when training on NN co-trained embodiments. The empirical fit is

S(N)S(SS1)Nα,α[0.18,0.30]S(N) \approx S_\infty - (S_\infty - S_1) N^{-\alpha},\quad \alpha \in [0.18, 0.30]

where SS_\infty is the asymptotic success rate and S1S_1 is single-embodiment success. The fitted α\alpha varies by task family but is consistently positive, increasing NN never hurts.

Adding an embodiment

A new robot embodiment is added to the network by:

  1. Defining the modality projector and decoder MLP shapes from the URDF (joint count, sensor configuration).
  2. Collecting a teleoperated dataset of at least 1000\sim 1000 episodes covering the embodiment's task envelope.
  3. Fine-tuning the modality projector + decoder against the frozen shared backbone.
  4. Submitting an embodiment registration proposal to governance with the eval results.
  5. Upon approval, the embodiment tag is added to the active registry.

Items 1–3 are amenable to community contribution; items 4–5 are on-chain.

Embodiment registry

The set of recognized embodiments is an on-chain registry: a mapping from embodiment ID to (tag embedding hash, projector shape, decoder shape, license). Validators serving inference for a given embodiment query the registry to load the correct projector and decoder weights.

The registry is append-only at the protocol level. Deprecation is supported (an embodiment can be marked inactive) but not deletion.

Constraints on what can be added

Not every body can be added. The codex enforces:

  • Action space dimensionality 64\leq 64. Higher-dimensional bodies are decomposable into chained sub-bodies.
  • Control frequency in [5,100][5, 100] Hz for the high-level layer. Faster low-level loops live in the embodiment-specific controller.
  • Observation space must include at least one camera and one proprioceptive signal. Pure-tactile or pure-IMU embodiments require a wrapping layer and are subject to a separate eval suite.

These constraints are documented limits, not aspirations. An embodiment outside them is excluded until the constraints themselves are revised through governance.