OGI

The VLA Architecture

The Vision-Language-Action (VLA) architecture is the specific instantiation of the embodiment surface used by OGI. It is documented here in enough detail that an independent reader could re-implement it.

High-level structure

The architecture has four blocks:

  1. Perceptual encoders. Multi-camera vision + proprioception + tactile + force/torque encoders, each mapping its input into the shared latent.
  2. Cognitive backbone. A transformer stack that fuses perceptual tokens, language instruction, and a small set of cognition tokens, fixed-count learnable queries that absorb task-relevant information from the rest of the latent.
  3. Action transformer. A multi-stream transformer that consumes cognition tokens, current state, and a noised action chunk, and denoises the action chunk to a clean motor plan via flow matching.
  4. Embodiment-conditioned decoder. A small MLP head per embodiment family that maps the clean action chunk into the embodiment's native command space.

Formally, with cognition tokens cRK×dc \in \mathbb{R}^{K \times d}, state sRdss \in \mathbb{R}^{d_s}, and action chunk aRH×daa \in \mathbb{R}^{H \times d_a}:

c=Backbone(o,g)c = \text{Backbone}(o, g) v^=ActionTransformer(c,s,at,t)\hat{v} = \text{ActionTransformer}(c, s, a_t, t) a0=ODESolver(v^,a1,t=10)a_0 = \text{ODESolver}(\hat{v}, a_1, t = 1 \to 0) u=Decodere(a0)u = \text{Decoder}_e(a_0)

where ata_t is the noised action chunk at flow time t[0,1]t \in [0,1], v^\hat{v} is the predicted velocity field, a0a_0 is the clean action, and uu is the embodiment-native command emitted to the robot.

Cognition tokens

Cognition tokens are a fixed-count set of learnable queries (typically K=64K = 64) that cross-attend to the full perceptual + language latent and absorb a compressed representation of the task. They are the only output of the backbone consumed by the action transformer.

This is a deliberate bottleneck. It forces the model to summarize, which empirically improves sample efficiency and reduces the action transformer's attention cost from O(N2)O(N^2) to O(KN)O(K \cdot N) where NKN \gg K.

Multi-stream attention

The action transformer is not a flat transformer. It has three parallel token streams that interleave in attention:

  • VL stream, cognition tokens from the backbone.
  • SA stream, state and noised-action tokens.
  • P stream, physics tokens (tactile, force, torque) when available.

Within each stream, attention is dense. Across streams, attention is mediated through a learned gate. This multi-stream design separates the role of each input class while still permitting the cross-modal binding necessary for grounded action.

Each block of the action transformer is either a DoubleStream block (VL + SA, when no physics) or a TripleStream block (VL + SA + P). The two block types share parameters where shapes allow.

Flow matching as the action head

The action head is a continuous-time flow matching model. Given a noise prior a1N(0,I)a_1 \sim \mathcal{N}(0, I) and a clean target a0a_0, the conditional probability path is the straight-line interpolation

at=(1t)a0+ta1,t[0,1]a_t = (1-t)\, a_0 + t\, a_1,\qquad t \in [0,1]

with target velocity

vt=datdt=a1a0.v_t = \frac{da_t}{dt} = a_1 - a_0.

The model v^θ(at,t,c,s)\hat{v}_\theta(a_t, t, c, s) is trained to predict vtv_t under the flow-matching loss

LFM(θ)=EtU[0,1],a0D,a1N(0,I)v^θ(at,t,c,s)(a1a0)22.\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}[0,1],\, a_0 \sim \mathcal{D},\, a_1 \sim \mathcal{N}(0,I)} \left\| \hat{v}_\theta(a_t, t, c, s) - (a_1 - a_0) \right\|_2^2.

At inference, the clean action is recovered by integrating the learned velocity backward in time:

a0=a101v^θ(at,t,c,s)dt.a_0 = a_1 - \int_0^1 \hat{v}_\theta(a_t, t, c, s) \, dt.

The integral is approximated by Euler steps; empirically, four steps suffice. This gives the architecture its characteristic low-step inference.

Action chunking

The policy emits actions in chunks of HH steps (typically H=16H = 16). Within a chunk, the action is denoised jointly: the model sees the entire chunk at once, not one step at a time. This is critical for two reasons:

  • Temporal consistency. Joint denoising produces smoother trajectories than autoregressive emission.
  • Throughput. A single forward pass emits 16 steps, amortizing inference cost across the control loop.

Chunks overlap at their boundaries; the real-time chunking (RTC) protocol resolves the overlap to maintain continuous motion. Two RTC modes are supported:

  • Guided, the next chunk is conditioned on the trailing actions of the current chunk.
  • Trained, the model is trained with chunk-boundary masking and emits boundary-aware chunks natively.

Latency budget

The architecture targets:

StageLatency
Perceptual encoding8ms\leq 8\,\text{ms}
Backbone forward18ms\leq 18\,\text{ms}
Action transformer (4 steps)14ms\leq 14\,\text{ms}
Decoder + serialization4ms\leq 4\,\text{ms}
End-to-end44ms\leq 44\,\text{ms}

At 44 ms, the effective control frequency is 23Hz\approx 23\,\text{Hz}, sufficient for dexterous manipulation. Compiled-graph implementations on current consumer-grade accelerators meet this budget.

Verification

Action outputs are verified by three mechanisms:

  1. Loss-curve attestation. During training, providers commit to a hash of the per-step loss curve. Replays must produce loss within an ε\varepsilon-band.
  2. Eval-suite acceptance. Held-out tasks (manipulation suites, locomotion benchmarks) yield scalar scores that gate publication.
  3. Closed-loop divergence. At inference, predicted next-state (from the World Model) is compared to observed next-state. Persistent divergence flags the validator.

See Validators for the economic side.