Reasoning & Planning
The reasoning surface produces structured, multi-step outputs from latent input. Its outputs are not raw token streams (those are produced by the language surface) but plans, proofs, and decisions in a constrained grammar.
Why reasoning is its own surface
A common architectural mistake is to fold reasoning into language by training a single transformer to produce chain-of-thought as plain text. This works, but:
- It conflates what to say with how to think. The language surface ends up specializing for both, and is good at neither at the frontier.
- It makes verification harder. Plain text reasoning is hard to check; structured reasoning is checkable by construction.
- It wastes compute. Reasoning has a different latency profile than language and benefits from a smaller, deeper architecture.
OGI separates the two. The language surface emits natural-language outputs. The reasoning surface emits structured plans.
Architecture
The reasoning surface is a transformer with:
| Property | Value |
|---|---|
| Parameters | 2.1B dense |
| Layers | 64 (deeper than language) |
| Hidden dimension | 4,096 |
| Output grammar | Typed plan trees |
The output is not a token stream. It is a tree, emitted node-by-node, where each node has a type (assertion, subgoal, action, branch, terminal) and a payload (latent tokens consumed by other surfaces).
The plan grammar
Plans are typed trees. A minimal example:
goal: "open the drawer to retrieve the red object"
├─ subgoal: "locate the drawer"
│ └─ action: <embodiment: visual-scan>
├─ subgoal: "open the drawer"
│ ├─ action: <embodiment: approach>
│ └─ action: <embodiment: grasp-handle, pull>
└─ subgoal: "extract the red object"
├─ branch: <visible? object>
│ ├─ true: <embodiment: grasp, lift>
│ └─ false: <subgoal: "search the drawer">
└─ terminal: <return>
Each <action> and <subgoal> is a latent token reference. Other surfaces consume them in the order dictated by tree traversal.
Planning algorithms
The reasoning surface is trained to produce trees, but at inference time the trees can be expanded by search. Three search modes are supported:
- Greedy. Single-pass tree generation. Lowest latency, weakest guarantees.
- Best-first. Beam search over partial trees, scored by a learned heuristic. Standard mode.
- Monte Carlo Tree Search. Used when the action space is large and a value model is available (typically: embodied planning, code synthesis).
Mode selection is dictated by the latency budget of the calling context. An interactive query uses greedy or best-first. An embodied task with safety implications uses MCTS.
Self-verification
Reasoning outputs are validated against the latent state they were produced from. The grammar makes this tractable: each plan node has a typed signature, and the verifier checks that the inputs and outputs of each node match the latent tokens declared in its payload.
Plans that fail self-verification do not leave the surface. They are either repaired (the surface re-emits the failing subtree) or abandoned (the surface returns an abstain token to the caller).
Training
Three training signals:
- Distillation from search. Plans produced by MCTS are distilled back into greedy-emission targets.
- Outcome supervision. Plans whose terminal results pass downstream evaluation are upweighted; those that fail are downweighted.
- Process supervision. Per-node correctness annotations on a curated set of human-verified plans.
The mixture and ratios are detailed in Pretraining.
Failure modes
Loop emission. Trees that recurse without progress. Mitigation: structural penalty during training; runtime depth and node-count limits.
Premature commitment. Trees that emit a terminal before sufficient subgoals. Mitigation: outcome supervision; the surface learns that early termination is rarely rewarded.
Plan-action mismatch. Plans that reference actions the embodiment surface cannot execute. Mitigation: the embodiment surface emits an action-feasibility vector that the reasoning surface conditions on.
What reasoning does not do
It does not write English. Plans rendered as text for human inspection are converted by the language surface from the structured tree. It does not act. Plan nodes typed as <action> are claims that an action should occur; the embodiment surface decides whether and how to execute them.
The codex's separation here is strict. A system in which reasoning and action are entangled is harder to verify and harder to interrupt. Both are critical.