OGI

Continual Learning

Continual learning is the system's capacity to acquire new competences without forgetting prior ones. The codex's general intelligence cannot be a single training run; it must be a continuous process over the deployed network.

The problem

A network trained on task AA then fine-tuned on task BB typically exhibits catastrophic forgetting: performance on AA degrades sharply. Formally, if θ=argminθLB(θ)\theta^* = \arg\min_\theta \mathcal{L}_B(\theta) is the unconstrained optimum on BB, then

LA(θ)LA(θpre)\mathcal{L}_A(\theta^*) \gg \mathcal{L}_A(\theta_{\mathrm{pre}})

in general. Continual learning is the family of techniques that prevent this.

Three regimes

The codex distinguishes three continual-learning regimes:

RegimeWhat changesCadenceExample
Task additionNew task, existing modalitiesweeklyNew benchmark passes added
Modality additionNew modalitymonthlyNew sensor or actuator type
Distribution shiftUnderlying data distributioncontinuousWeb crawl drift, embodied data accumulation

Each regime is handled by different machinery.

Elastic weight consolidation

For task addition, the codex applies elastic weight consolidation with the loss

LBA(θ)=LB(θ)+λ2iFi(θiθiA)2\mathcal{L}_{B|A}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_i^A)^2

where FiF_i is the diagonal of the Fisher information matrix at θA\theta^A and θiA\theta_i^A is the parameter value after task AA. Parameters important for AA (high FiF_i) are penalized for moving; unimportant parameters are free.

The Fisher information is estimated empirically over a small replay buffer of AA's training data. The penalty coefficient λ\lambda is tuned per task pair.

Experience replay

For distribution shift, the dominant mechanism is replay. The training mixture at any update includes a fraction of historical data drawn from a reservoir-sampled buffer. The mixture ratio is typically 90% new / 10% historical, though it varies by surface.

The replay buffer is sharded across validators: each validator holds a slice, and updates draw from a federated sample of the global buffer.

Adapter-based addition

For modality addition, new modalities are introduced as adapters rather than as full fine-tunes. An adapter is a low-rank update

W=W+BA,BRd×r, ARr×d, rdW' = W + B A,\qquad B \in \mathbb{R}^{d \times r},\ A \in \mathbb{R}^{r \times d},\ r \ll d

trained while the base WW is frozen. The new modality's encoder + adapter is fully trainable; the rest of the network is read-only.

After eval-suite verification, adapters can be merged into the base or retained as separate modules. Most adapters remain separate; merging is reserved for adapters that benefit a sufficiently broad set of downstream tasks.

The forgetting eval

Continual learning is verified through a forgetting eval: a regression suite of historical tasks run before and after every continual-learning update. An update that regresses any historical task by more than a configured threshold is rejected by the protocol's eval gate (see Validators). The threshold is per-task and is set at training time.

Versioning

Continual learning produces a long chain of checkpoints. The codex uses a directed-acyclic-graph version model: each checkpoint references its parent(s) and the training data that produced it. Forks are permitted (a checkpoint can branch for an experimental update); merges are constrained (two diverged checkpoints can only be merged by training a new checkpoint on the union of their data).

Version state is on-chain. The active routing table points to a single checkpoint per surface, updated by governance.

Why this is the codex's hardest problem

Continual learning is the surface most likely to fail silently. Forgetting can be subtle; an eval suite cannot cover every regression. The codex's response is to (1) maintain a deliberately broad eval suite, (2) require validator-attested replays of historical tasks, and (3) keep the version chain long enough that any regression can be traced and reverted.