Data Layer
The data layer holds everything the model has been or will be trained on, everything it has produced, and everything users have submitted to it. Its size is multiple orders of magnitude greater than the chain's. Most of it never touches the settlement layer; it is referenced by hash and held by validators.
What the data layer holds
Four classes of data:
- Training manifests. Per-checkpoint manifests describing the data used to train that checkpoint: source URLs, licenses, hashes, sampling weights. Manifests are small (kilobytes); the data they reference is large.
- Bulk training data. The actual bytes referenced by manifests: text, image, video, embodied demonstrations. Held in content-addressed storage.
- Inference logs. A sampled subset of inference requests and outputs, retained for evaluation, regression testing, and auditing.
- Embodied episode contributions. New demonstrations submitted by validators or users; the unique data source for the embodiment surfaces.
Content addressing
All bulk data is referenced by its hash. The hash function is BLAKE3 for new content; legacy SHA-256 references are accepted for backwards compatibility with imported datasets.
A reference to data is a tuple (hash, format, size). Resolution is the responsibility of the data layer: any validator holding a chunk with that hash can serve it. There is no global directory; resolution proceeds by:
- Asking the requesting validator's peers.
- Falling back to the network's distributed hash table.
- Falling back to public mirrors (Arweave, IPFS, public S3 buckets), in that order.
The protocol does not require any specific storage backend. It requires only that any participant can produce the bytes corresponding to a hash when paid to do so.
Manifests
A training manifest is a structured document with the following schema (simplified):
manifest {
version: string
produced_by: validator_id
produced_at: timestamp
signature: bytes
sources: list[source]
splits: list[split]
}
source {
hash: bytes32
format: enum{text, image, video, audio, episode}
size_bytes: u64
license: license_id
origin: optional[url]
filter_chain: list[filter_id]
}
split {
name: string // pretrain | finetune | eval | replay |...
weight: float
source_indices: list[u32]
}
Manifests are themselves content-addressed; a checkpoint references its manifest by hash, and the manifest references its sources by hash. The chain of hashes makes every byte in the training pipeline traceable.
License enforcement
Every source carries a license tag. A small set of approved licenses is whitelisted by governance; sources with unapproved licenses are excluded by the build pipeline. The list of approved licenses is published and revisable.
This is the only point in the network where the protocol takes a position on intellectual property. Outside the build pipeline, the network is license-neutral; inside it, the protocol refuses to train on incompatible data.
Embodied contributions
The unique value of the network's data layer is the continuous stream of embodied episodes contributed by validators and users. An episode is a recorded trajectory of observations, actions, and outcomes from a physical or simulated body.
Contributions are submitted by:
- Recording an episode with a registered embodiment.
- Encoding the episode in the LeRobot-compatible format.
- Computing the content hash and submitting a contribution claim on chain.
- Hosting the bytes (the contributor's responsibility) or providing them to a pinning service.
Contributions are paid from the treasury at a per-byte and per-episode rate set by governance. Payment is contingent on the episode passing a quality filter: the episode must be playable, the action sequence must be consistent with the observations, and the outcome must be one of the recognized labels.
Quality filtering
Contributed data is filtered before entering the training corpus:
- Format check. The episode parses correctly and contains all required fields.
- Playback check. The episode can be re-rendered from its recorded form without error.
- Action-observation consistency. A small learned model checks whether the observed transitions are plausible given the recorded actions. Implausible episodes (likely fabricated or corrupted) are rejected.
- Duplicate check. Hash-based deduplication against the existing corpus.
- Sensitivity filter. Episodes containing personally identifiable information or other restricted content are excluded.
Filtered episodes that pass all five checks enter the corpus. Filtered episodes that fail are reported back to the contributor with the reason.
Right-to-be-forgotten
A user-bound write path includes a forget tag. Tagged entries are removable through a governance-approved removal request, which:
- Removes the entry from active manifests.
- Triggers a small fine-tuning job to reduce the trained model's memorization of the entry.
- Records the removal on chain.
Full unlearning is an unsolved problem. The codex's response is best-effort: removal is tractable for explicit references and approximate for diffuse influence, with no claim of cryptographic guarantee.
Storage incentives
Validators serving the storage role earn storage rewards proportional to:
- Bytes held (linear).
- Bytes served (linear, with per-request cap).
- Cache freshness (a learned policy that pays more for hot chunks).
The total storage budget is a fraction of treasury revenue set by governance. Reward distribution is via the standard attestation flow: a storage validator signs a proof of holding (a Merkle path against a random challenge) and the chain credits its account.