Vision
The vision surface is responsible for converting pixel arrays and depth maps into shared-latent tokens, and conversely for emitting structured visual outputs (regions, captions, masks) from latent input. It is the largest surface by inbound bandwidth and one of the most latency-sensitive.
Architecture
The vision surface is a hierarchical encoder-decoder transformer with two stages:
Stage 1, Patch encoder. Variable-resolution input is normalized to a sequence of 14×14 patches with rotary spatial position encoding. Up to 4,096 patches per image; resolution is allocated dynamically based on detected complexity (aspect-area resize, max area 65,536 patches at full pyramid).
Stage 2, Temporal compressor. For video and image sequences, a 4-layer temporal transformer fuses adjacent frames into a compressed latent stream. The compressor reduces per-frame token count by approximately 6× without observable loss on downstream tasks.
The final output is a sequence of latent tokens that occupy the same hidden dimension as the language surface (d = 4096).
Input modalities
The vision surface accepts:
| Modality | Resolution | Frame rate | Use |
|---|---|---|---|
| RGB still | up to 4K | — | Document, scene, object inputs |
| RGB video | up to 1080p | up to 60 Hz | Embodiment perception, event detection |
| RGBD | 720p | up to 30 Hz | 3D scene understanding, grasping |
| Stereo pairs | 720p × 2 | up to 30 Hz | Depth estimation, locomotion |
| Point cloud | up to 4M points | — | Spatial reasoning, mapping |
All modalities project into the same latent space. Modality is encoded via a learned modality embedding prepended to the input sequence.
Outputs
The vision surface emits four kinds of structured output:
- Latent tokens. The primary output; consumed by other surfaces.
- Region masks. Pixel-level segmentation, queryable by latent reference.
- Bounding boxes. Object localization with confidence.
- Captions. Routed to the language surface for tokenization.
A single forward pass can produce any subset of these.
Training signals
The vision surface is trained against five objectives:
- Masked patch reconstruction on web-scale image-text pairs.
- Contrastive image-text alignment against the language surface.
- Dense prediction (segmentation, depth) on curated annotation datasets.
- Video prediction for temporal coherence.
- Validator-contributed embodied video under the data-layer pipeline; this is the unique signal not available to centralized labs.
Latency budget
Vision is the surface most likely to bottleneck embodiment. Its latency tiers:
| Tier | Per-frame latency | Used for |
|---|---|---|
| Realtime | < 16ms | Manipulation, locomotion |
| Interactive | < 100ms | User queries, document parsing |
| Batch | best-effort | Indexing, retrieval |
Realtime-tier validators run a distilled, lower-parameter variant of the surface. The distillation is published alongside the full checkpoint.
Verification
Vision outputs are challenging to verify because the output space is high-dimensional and frequently subjective. The network handles this by verifying invariants rather than exact outputs:
- Geometric invariance. Outputs on the same scene from different viewpoints must satisfy known geometric constraints.
- Temporal coherence. Adjacent frame outputs must satisfy continuity bounds.
- Cross-modal grounding. Visual outputs paired with text must satisfy contrastive alignment scores from a separate, audited model.
Outputs that violate invariants trigger the standard challenge mechanism.
What this surface does not do
It does not reason about what it sees. A visual question that requires multi-step inference is forwarded to the reasoning surface. It does not remember scenes. Persistent visual memory lives in the memory surface as compressed latent traces.
Coupling with embodiment
Vision is the largest input to the embodiment surface. The interface between the two is specified in The VLA Architecture. Readers concerned with manipulation should treat that chapter as a continuation of this one.