Multimodal Fusion

Multimodal fusion is the surface that decides how inputs from different encoders combine in the shared latent. It is not a separate model; it is a set of conventions and a small routing network that govern how heterogeneous tokens interleave.

The fusion problem

A system that ingests language, vision, audio, and proprioception simultaneously faces a routing problem: at any given step, which modality should the next attention operation prioritize? Naive interleaving, concatenating all modalities into one sequence, is computationally expensive and tends to let the highest-bandwidth modality dominate.

OGI's solution is typed slots.

Typed slots

The shared latent is partitioned into typed regions. Each region accepts tokens from a single encoder family. Cross-region attention is allowed but is mediated by a small (60M parameter) routing network that learns when cross-modal attention is productive.

[<lang>  L L L L  <vision>  V V V V V V V V  <prop>  P P  <plan>  X X X ]

Within each typed region, attention is local-dense. Across regions, attention is global-sparse and routed.

The router

The fusion router is a transformer with a single attention layer and a learned gate. At each forward pass it computes, for every pair (region_i, region_j), a binary mask indicating whether attention from i to j is enabled this step.

Empirically, the router enables approximately 38% of possible cross-modal attention edges in the steady state. The remainder are pruned, saving compute proportionally.

The router's gating decisions are observable, logged, and contribute to interpretability tooling. They are not the only path between modalities, direct token references via the latent are always available, but they shape the dominant flow.

Modality embeddings

Each token in the latent carries a modality embedding indicating its origin. The modality embedding is a single learned vector per origin (lang, vision, audio, proprioception, plan, memory, motor) added to the token at injection time.

This is what makes the latent uniformly indexable: any consumer surface can attend to any token regardless of origin, with the modality embedding providing the necessary type information.

When fusion fails

Multimodal models fail in characteristic ways. OGI's fusion surface is engineered against three known failure modes:

Modality collapse. One modality (typically language) dominates and the others become functionally inert. Mitigation: explicit modality dropout during training; the model must produce competent outputs when any single modality is masked.

Spurious correlation. The model learns to predict a target from one modality when the signal actually lives in another. Mitigation: causal interventions during evaluation, in which the suspected confounding modality is held constant.

Catastrophic interference. Adding a new modality degrades existing modalities. Mitigation: new modalities are introduced as adapter modules first, then merged into the base only after passing regression suites on all prior modalities.

Adding a modality

The fusion surface is designed to accept new modalities without retraining the cognitive backbone. The procedure:

Train a new encoder against the shared latent using a contrastive objective with at least one anchor modality.
Allocate a typed slot and modality embedding.
Verify on the cross-modal eval suite.
Submit a governance proposal to add the modality to the active routing table.

Adding a modality is therefore an additive operation. Removing one is not; weights, eval data, and routing rules trained on a deprecated modality remain in the system.

Interaction with embodiment

For embodied control, the fusion surface is the integration point between vision, proprioception, and tactile signals. The Multi-Stream Action Transformer described in The VLA Architecture is a specialization of the fusion surface for the low-latency motor regime.