The VLA Architecture

The Vision-Language-Action (VLA) architecture is the specific instantiation of the embodiment surface used by OGI. It is documented here in enough detail that an independent reader could re-implement it.

High-level structure

The architecture has four blocks:

Perceptual encoders. Multi-camera vision + proprioception + tactile + force/torque encoders, each mapping its input into the shared latent.
Cognitive backbone. A transformer stack that fuses perceptual tokens, language instruction, and a small set of cognition tokens, fixed-count learnable queries that absorb task-relevant information from the rest of the latent.
Action transformer. A multi-stream transformer that consumes cognition tokens, current state, and a noised action chunk, and denoises the action chunk to a clean motor plan via flow matching.
Embodiment-conditioned decoder. A small MLP head per embodiment family that maps the clean action chunk into the embodiment's native command space.

Formally, with cognition tokens $c \in \mathbb{R}^{K \times d}$ , state $s \in \mathbb{R}^{d_s}$ , and action chunk $a \in \mathbb{R}^{H \times d_a}$ :

$c = \text{Backbone}(o, g)$ $\hat{v} = \text{ActionTransformer}(c, s, a_t, t)$ $a_0 = \text{ODESolver}(\hat{v}, a_1, t = 1 \to 0)$ $u = \text{Decoder}_e(a_0)$

where $a_t$ is the noised action chunk at flow time $t \in [0,1]$ , $\hat{v}$ is the predicted velocity field, $a_0$ is the clean action, and $u$ is the embodiment-native command emitted to the robot.

Cognition tokens

Cognition tokens are a fixed-count set of learnable queries (typically $K = 64$ ) that cross-attend to the full perceptual + language latent and absorb a compressed representation of the task. They are the only output of the backbone consumed by the action transformer.

This is a deliberate bottleneck. It forces the model to summarize, which empirically improves sample efficiency and reduces the action transformer's attention cost from $O(N^2)$ to $O(K \cdot N)$ where $N \gg K$ .

Multi-stream attention

The action transformer is not a flat transformer. It has three parallel token streams that interleave in attention:

VL stream, cognition tokens from the backbone.
SA stream, state and noised-action tokens.
P stream, physics tokens (tactile, force, torque) when available.

Within each stream, attention is dense. Across streams, attention is mediated through a learned gate. This multi-stream design separates the role of each input class while still permitting the cross-modal binding necessary for grounded action.

Each block of the action transformer is either a DoubleStream block (VL + SA, when no physics) or a TripleStream block (VL + SA + P). The two block types share parameters where shapes allow.

Flow matching as the action head

The action head is a continuous-time flow matching model. Given a noise prior $a_1 \sim \mathcal{N}(0, I)$ and a clean target $a_0$ , the conditional probability path is the straight-line interpolation

$a_t = (1-t)\, a_0 + t\, a_1,\qquad t \in [0,1]$

with target velocity

$v_t = \frac{da_t}{dt} = a_1 - a_0.$

The model $\hat{v}_\theta(a_t, t, c, s)$ is trained to predict $v_t$ under the flow-matching loss

$\mathcal{L}_{\mathrm{FM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}[0,1],\, a_0 \sim \mathcal{D},\, a_1 \sim \mathcal{N}(0,I)} \left\| \hat{v}_\theta(a_t, t, c, s) - (a_1 - a_0) \right\|_2^2.$

At inference, the clean action is recovered by integrating the learned velocity backward in time:

$a_0 = a_1 - \int_0^1 \hat{v}_\theta(a_t, t, c, s) \, dt.$

The integral is approximated by Euler steps; empirically, four steps suffice. This gives the architecture its characteristic low-step inference.

Action chunking

The policy emits actions in chunks of $H$ steps (typically $H = 16$ ). Within a chunk, the action is denoised jointly: the model sees the entire chunk at once, not one step at a time. This is critical for two reasons:

Temporal consistency. Joint denoising produces smoother trajectories than autoregressive emission.
Throughput. A single forward pass emits 16 steps, amortizing inference cost across the control loop.

Chunks overlap at their boundaries; the real-time chunking (RTC) protocol resolves the overlap to maintain continuous motion. Two RTC modes are supported:

Guided, the next chunk is conditioned on the trailing actions of the current chunk.
Trained, the model is trained with chunk-boundary masking and emits boundary-aware chunks natively.

Latency budget

The architecture targets:

Stage	Latency
Perceptual encoding	$\leq 8\,\text{ms}$
Backbone forward	$\leq 18\,\text{ms}$
Action transformer (4 steps)	$\leq 14\,\text{ms}$
Decoder + serialization	$\leq 4\,\text{ms}$
End-to-end	$\leq 44\,\text{ms}$

At 44 ms, the effective control frequency is $\approx 23\,\text{Hz}$ , sufficient for dexterous manipulation. Compiled-graph implementations on current consumer-grade accelerators meet this budget.

Verification

Action outputs are verified by three mechanisms:

Loss-curve attestation. During training, providers commit to a hash of the per-step loss curve. Replays must produce loss within an $\varepsilon$ -band.
Eval-suite acceptance. Held-out tasks (manipulation suites, locomotion benchmarks) yield scalar scores that gate publication.
Closed-loop divergence. At inference, predicted next-state (from the World Model) is compared to observed next-state. Persistent divergence flags the validator.

See Validators for the economic side.