Data Ingestion¶
The data ingestion layer is responsible for extracting raw blockchain data from various sources and loading it into ClickHouse. Each indexer is purpose-built for a specific data source and runs as an independent containerized service.
This section covers all ingestion components: the execution-layer and consensus-layer indexers, the era file parser for historical backfills, the click-runner for external data sources, and the network crawlers that capture P2P topology.
Pipeline Architecture¶
graph LR
subgraph Sources
EL[Execution Layer<br/>RPC Node]
CL[Consensus Layer<br/>Beacon Node]
P2P[P2P Network<br/>DHT Peers]
EXT[External<br/>Ember, ProbeLab]
end
subgraph Indexers
CRYO[cryo-indexer]
BEACON[beacon-indexer]
ERA[era-parser]
CR[click-runner]
NEB[nebula]
IPC[ip-crawler]
end
subgraph Storage
CH[(ClickHouse Cloud)]
end
subgraph Transformation
DBT[dbt-cerebro<br/>~400 models]
end
subgraph Serving
API[REST API]
MCP[MCP / AI Tools]
DASH[Dashboards]
end
EL --> CRYO
CL --> BEACON
CL --> ERA
EXT --> CR
P2P --> NEB
NEB --> IPC
CRYO --> CH
BEACON --> CH
ERA --> CH
CR --> CH
NEB --> CH
IPC --> CH
CH --> DBT
DBT --> CH
CH --> API
API --> MCP
API --> DASH Indexer Overview¶
| Indexer | Source | Target Database | Language | Key Capability |
|---|---|---|---|---|
| cryo-indexer | Execution layer RPC | execution | Python + Cryo (Rust) | Blocks, transactions, logs, traces, state diffs |
| beacon-indexer | Beacon node REST API | consensus | Python | Validators, attestations, sync committees |
| era-parser | Era archive files | consensus | Python | Historical beacon chain bulk loading |
| click-runner | CSV/Parquet/SQL | Various | Python | External data ingestion (Ember, ProbeLab) |
Supporting Components¶
| Component | Purpose |
|---|---|
| cryo-base | Docker base image with pre-compiled Cryo binary and custom patches |
Design Principles¶
All indexers in this layer follow common design principles:
Atomic processing -- Data is loaded in complete chunks. A range of blocks is either fully committed or not committed at all. Partial writes are avoided.
State tracking -- Each indexer maintains a state table in ClickHouse that records which ranges have been processed, enabling resumability and failure recovery.
Incremental operation -- Indexers support both historical backfill (bulk loading of past data) and continuous mode (following the chain tip in real time).
Containerized deployment -- Every indexer ships as a Docker image with Docker Compose configurations for straightforward deployment and orchestration.
Idempotency -- Reprocessing a range that has already been loaded produces the same result, using ReplacingMergeTree engines and deduplication strategies in ClickHouse.