Vision

The vision surface is responsible for converting pixel arrays and depth maps into shared-latent tokens, and conversely for emitting structured visual outputs (regions, captions, masks) from latent input. It is the largest surface by inbound bandwidth and one of the most latency-sensitive.

Architecture

The vision surface is a hierarchical encoder-decoder transformer with two stages:

Stage 1, Patch encoder. Variable-resolution input is normalized to a sequence of 14×14 patches with rotary spatial position encoding. Up to 4,096 patches per image; resolution is allocated dynamically based on detected complexity (aspect-area resize, max area 65,536 patches at full pyramid).

Stage 2, Temporal compressor. For video and image sequences, a 4-layer temporal transformer fuses adjacent frames into a compressed latent stream. The compressor reduces per-frame token count by approximately 6× without observable loss on downstream tasks.

The final output is a sequence of latent tokens that occupy the same hidden dimension as the language surface (d = 4096).

Input modalities

The vision surface accepts:

Modality	Resolution	Frame rate	Use
RGB still	up to 4K	—	Document, scene, object inputs
RGB video	up to 1080p	up to 60 Hz	Embodiment perception, event detection
RGBD	720p	up to 30 Hz	3D scene understanding, grasping
Stereo pairs	720p × 2	up to 30 Hz	Depth estimation, locomotion
Point cloud	up to 4M points	—	Spatial reasoning, mapping

All modalities project into the same latent space. Modality is encoded via a learned modality embedding prepended to the input sequence.

Outputs

The vision surface emits four kinds of structured output:

Latent tokens. The primary output; consumed by other surfaces.
Region masks. Pixel-level segmentation, queryable by latent reference.
Bounding boxes. Object localization with confidence.
Captions. Routed to the language surface for tokenization.

A single forward pass can produce any subset of these.

Training signals

The vision surface is trained against five objectives:

Masked patch reconstruction on web-scale image-text pairs.
Contrastive image-text alignment against the language surface.
Dense prediction (segmentation, depth) on curated annotation datasets.
Video prediction for temporal coherence.
Validator-contributed embodied video under the data-layer pipeline; this is the unique signal not available to centralized labs.

Latency budget

Vision is the surface most likely to bottleneck embodiment. Its latency tiers:

Tier	Per-frame latency	Used for
Realtime	< 16ms	Manipulation, locomotion
Interactive	< 100ms	User queries, document parsing
Batch	best-effort	Indexing, retrieval

Realtime-tier validators run a distilled, lower-parameter variant of the surface. The distillation is published alongside the full checkpoint.

Verification

Vision outputs are challenging to verify because the output space is high-dimensional and frequently subjective. The network handles this by verifying invariants rather than exact outputs:

Geometric invariance. Outputs on the same scene from different viewpoints must satisfy known geometric constraints.
Temporal coherence. Adjacent frame outputs must satisfy continuity bounds.
Cross-modal grounding. Visual outputs paired with text must satisfy contrastive alignment scores from a separate, audited model.

Outputs that violate invariants trigger the standard challenge mechanism.

What this surface does not do

It does not reason about what it sees. A visual question that requires multi-step inference is forwarded to the reasoning surface. It does not remember scenes. Persistent visual memory lives in the memory surface as compressed latent traces.

Coupling with embodiment

Vision is the largest input to the embodiment surface. The interface between the two is specified in The VLA Architecture. Readers concerned with manipulation should treat that chapter as a continuation of this one.