OGI

Vision

The vision surface is responsible for converting pixel arrays and depth maps into shared-latent tokens, and conversely for emitting structured visual outputs (regions, captions, masks) from latent input. It is the largest surface by inbound bandwidth and one of the most latency-sensitive.

Architecture

The vision surface is a hierarchical encoder-decoder transformer with two stages:

Stage 1, Patch encoder. Variable-resolution input is normalized to a sequence of 14×14 patches with rotary spatial position encoding. Up to 4,096 patches per image; resolution is allocated dynamically based on detected complexity (aspect-area resize, max area 65,536 patches at full pyramid).

Stage 2, Temporal compressor. For video and image sequences, a 4-layer temporal transformer fuses adjacent frames into a compressed latent stream. The compressor reduces per-frame token count by approximately 6× without observable loss on downstream tasks.

The final output is a sequence of latent tokens that occupy the same hidden dimension as the language surface (d = 4096).

Input modalities

The vision surface accepts:

ModalityResolutionFrame rateUse
RGB stillup to 4KDocument, scene, object inputs
RGB videoup to 1080pup to 60 HzEmbodiment perception, event detection
RGBD720pup to 30 Hz3D scene understanding, grasping
Stereo pairs720p × 2up to 30 HzDepth estimation, locomotion
Point cloudup to 4M pointsSpatial reasoning, mapping

All modalities project into the same latent space. Modality is encoded via a learned modality embedding prepended to the input sequence.

Outputs

The vision surface emits four kinds of structured output:

  1. Latent tokens. The primary output; consumed by other surfaces.
  2. Region masks. Pixel-level segmentation, queryable by latent reference.
  3. Bounding boxes. Object localization with confidence.
  4. Captions. Routed to the language surface for tokenization.

A single forward pass can produce any subset of these.

Training signals

The vision surface is trained against five objectives:

  • Masked patch reconstruction on web-scale image-text pairs.
  • Contrastive image-text alignment against the language surface.
  • Dense prediction (segmentation, depth) on curated annotation datasets.
  • Video prediction for temporal coherence.
  • Validator-contributed embodied video under the data-layer pipeline; this is the unique signal not available to centralized labs.

Latency budget

Vision is the surface most likely to bottleneck embodiment. Its latency tiers:

TierPer-frame latencyUsed for
Realtime< 16msManipulation, locomotion
Interactive< 100msUser queries, document parsing
Batchbest-effortIndexing, retrieval

Realtime-tier validators run a distilled, lower-parameter variant of the surface. The distillation is published alongside the full checkpoint.

Verification

Vision outputs are challenging to verify because the output space is high-dimensional and frequently subjective. The network handles this by verifying invariants rather than exact outputs:

  • Geometric invariance. Outputs on the same scene from different viewpoints must satisfy known geometric constraints.
  • Temporal coherence. Adjacent frame outputs must satisfy continuity bounds.
  • Cross-modal grounding. Visual outputs paired with text must satisfy contrastive alignment scores from a separate, audited model.

Outputs that violate invariants trigger the standard challenge mechanism.

What this surface does not do

It does not reason about what it sees. A visual question that requires multi-step inference is forwarded to the reasoning surface. It does not remember scenes. Persistent visual memory lives in the memory surface as compressed latent traces.

Coupling with embodiment

Vision is the largest input to the embodiment surface. The interface between the two is specified in The VLA Architecture. Readers concerned with manipulation should treat that chapter as a continuation of this one.