Continual Learning

Continual learning is the system's capacity to acquire new competences without forgetting prior ones. The codex's general intelligence cannot be a single training run; it must be a continuous process over the deployed network.

The problem

A network trained on task $A$ then fine-tuned on task $B$ typically exhibits catastrophic forgetting: performance on $A$ degrades sharply. Formally, if $\theta^* = \arg\min_\theta \mathcal{L}_B(\theta)$ is the unconstrained optimum on $B$ , then

$\mathcal{L}_A(\theta^*) \gg \mathcal{L}_A(\theta_{\mathrm{pre}})$

in general. Continual learning is the family of techniques that prevent this.

Three regimes

The codex distinguishes three continual-learning regimes:

Regime	What changes	Cadence	Example
Task addition	New task, existing modalities	weekly	New benchmark passes added
Modality addition	New modality	monthly	New sensor or actuator type
Distribution shift	Underlying data distribution	continuous	Web crawl drift, embodied data accumulation

Each regime is handled by different machinery.

Elastic weight consolidation

For task addition, the codex applies elastic weight consolidation with the loss

$\mathcal{L}_{B|A}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2}\sum_i F_i (\theta_i - \theta_i^A)^2$

where $F_i$ is the diagonal of the Fisher information matrix at $\theta^A$ and $\theta_i^A$ is the parameter value after task $A$ . Parameters important for $A$ (high $F_i$ ) are penalized for moving; unimportant parameters are free.

The Fisher information is estimated empirically over a small replay buffer of $A$ 's training data. The penalty coefficient $\lambda$ is tuned per task pair.

Experience replay

For distribution shift, the dominant mechanism is replay. The training mixture at any update includes a fraction of historical data drawn from a reservoir-sampled buffer. The mixture ratio is typically 90% new / 10% historical, though it varies by surface.

The replay buffer is sharded across validators: each validator holds a slice, and updates draw from a federated sample of the global buffer.

Adapter-based addition

For modality addition, new modalities are introduced as adapters rather than as full fine-tunes. An adapter is a low-rank update

$W' = W + B A,\qquad B \in \mathbb{R}^{d \times r},\ A \in \mathbb{R}^{r \times d},\ r \ll d$

trained while the base $W$ is frozen. The new modality's encoder + adapter is fully trainable; the rest of the network is read-only.

After eval-suite verification, adapters can be merged into the base or retained as separate modules. Most adapters remain separate; merging is reserved for adapters that benefit a sufficiently broad set of downstream tasks.

The forgetting eval

Continual learning is verified through a forgetting eval: a regression suite of historical tasks run before and after every continual-learning update. An update that regresses any historical task by more than a configured threshold is rejected by the protocol's eval gate (see Validators). The threshold is per-task and is set at training time.

Versioning

Continual learning produces a long chain of checkpoints. The codex uses a directed-acyclic-graph version model: each checkpoint references its parent(s) and the training data that produced it. Forks are permitted (a checkpoint can branch for an experimental update); merges are constrained (two diverged checkpoints can only be merged by training a new checkpoint on the union of their data).

Version state is on-chain. The active routing table points to a single checkpoint per surface, updated by governance.

Why this is the codex's hardest problem

Continual learning is the surface most likely to fail silently. Forgetting can be subtle; an eval suite cannot cover every regression. The codex's response is to (1) maintain a deliberately broad eval suite, (2) require validator-attested replays of historical tasks, and (3) keep the version chain long enough that any regression can be traced and reverted.