The VLA Architecture
The Vision-Language-Action (VLA) architecture is the specific instantiation of the embodiment surface used by OGI. It is documented here in enough detail that an independent reader could re-implement it.
High-level structure
The architecture has four blocks:
- Perceptual encoders. Multi-camera vision + proprioception + tactile + force/torque encoders, each mapping its input into the shared latent.
- Cognitive backbone. A transformer stack that fuses perceptual tokens, language instruction, and a small set of cognition tokens, fixed-count learnable queries that absorb task-relevant information from the rest of the latent.
- Action transformer. A multi-stream transformer that consumes cognition tokens, current state, and a noised action chunk, and denoises the action chunk to a clean motor plan via flow matching.
- Embodiment-conditioned decoder. A small MLP head per embodiment family that maps the clean action chunk into the embodiment's native command space.
Formally, with cognition tokens , state , and action chunk :
where is the noised action chunk at flow time , is the predicted velocity field, is the clean action, and is the embodiment-native command emitted to the robot.
Cognition tokens
Cognition tokens are a fixed-count set of learnable queries (typically ) that cross-attend to the full perceptual + language latent and absorb a compressed representation of the task. They are the only output of the backbone consumed by the action transformer.
This is a deliberate bottleneck. It forces the model to summarize, which empirically improves sample efficiency and reduces the action transformer's attention cost from to where .
Multi-stream attention
The action transformer is not a flat transformer. It has three parallel token streams that interleave in attention:
- VL stream, cognition tokens from the backbone.
- SA stream, state and noised-action tokens.
- P stream, physics tokens (tactile, force, torque) when available.
Within each stream, attention is dense. Across streams, attention is mediated through a learned gate. This multi-stream design separates the role of each input class while still permitting the cross-modal binding necessary for grounded action.
Each block of the action transformer is either a DoubleStream block (VL + SA, when no physics) or a TripleStream block (VL + SA + P). The two block types share parameters where shapes allow.
Flow matching as the action head
The action head is a continuous-time flow matching model. Given a noise prior and a clean target , the conditional probability path is the straight-line interpolation
with target velocity
The model is trained to predict under the flow-matching loss
At inference, the clean action is recovered by integrating the learned velocity backward in time:
The integral is approximated by Euler steps; empirically, four steps suffice. This gives the architecture its characteristic low-step inference.
Action chunking
The policy emits actions in chunks of steps (typically ). Within a chunk, the action is denoised jointly: the model sees the entire chunk at once, not one step at a time. This is critical for two reasons:
- Temporal consistency. Joint denoising produces smoother trajectories than autoregressive emission.
- Throughput. A single forward pass emits 16 steps, amortizing inference cost across the control loop.
Chunks overlap at their boundaries; the real-time chunking (RTC) protocol resolves the overlap to maintain continuous motion. Two RTC modes are supported:
- Guided, the next chunk is conditioned on the trailing actions of the current chunk.
- Trained, the model is trained with chunk-boundary masking and emits boundary-aware chunks natively.
Latency budget
The architecture targets:
| Stage | Latency |
|---|---|
| Perceptual encoding | |
| Backbone forward | |
| Action transformer (4 steps) | |
| Decoder + serialization | |
| End-to-end |
At 44 ms, the effective control frequency is , sufficient for dexterous manipulation. Compiled-graph implementations on current consumer-grade accelerators meet this budget.
Verification
Action outputs are verified by three mechanisms:
- Loss-curve attestation. During training, providers commit to a hash of the per-step loss curve. Replays must produce loss within an -band.
- Eval-suite acceptance. Held-out tasks (manipulation suites, locomotion benchmarks) yield scalar scores that gate publication.
- Closed-loop divergence. At inference, predicted next-state (from the World Model) is compared to observed next-state. Persistent divergence flags the validator.
See Validators for the economic side.