Data Ingestion¶

The data ingestion layer is responsible for extracting raw blockchain data from various sources and loading it into ClickHouse. Each indexer is purpose-built for a specific data source and runs as an independent containerized service.

This section covers all ingestion components: the execution-layer and consensus-layer indexers, the era file parser for historical backfills, the click-runner for external data sources, and the network crawlers that capture P2P topology.

Pipeline Architecture¶

graph LR
    subgraph Sources
        EL[Execution Layer<br/>RPC Node]
        CL[Consensus Layer<br/>Beacon Node]
        P2P[P2P Network<br/>DHT Peers]
        EXT[External<br/>Ember, ProbeLab]
    end

    subgraph Indexers
        CRYO[cryo-indexer]
        BEACON[beacon-indexer]
        ERA[era-parser]
        CR[click-runner]
        NEB[nebula]
        IPC[ip-crawler]
    end

    subgraph Storage
        CH[(ClickHouse Cloud)]
    end

    subgraph Transformation
        DBT[dbt-cerebro<br/>~400 models]
    end

    subgraph Serving
        API[REST API]
        MCP[MCP / AI Tools]
        DASH[Dashboards]
    end

    EL --> CRYO
    CL --> BEACON
    CL --> ERA
    EXT --> CR
    P2P --> NEB
    NEB --> IPC

    CRYO --> CH
    BEACON --> CH
    ERA --> CH
    CR --> CH
    NEB --> CH
    IPC --> CH

    CH --> DBT
    DBT --> CH

    CH --> API
    API --> MCP
    API --> DASH

Indexer Overview¶

Indexer	Source	Target Database	Language	Key Capability
cryo-indexer	Execution layer RPC	`execution`	Python + Cryo (Rust)	Blocks, transactions, logs, traces, state diffs
beacon-indexer	Beacon node REST API	`consensus`	Python	Validators, attestations, sync committees
era-parser	Era archive files	`consensus`	Python	Historical beacon chain bulk loading
click-runner	CSV/Parquet/SQL	Various	Python	External data ingestion (Ember, ProbeLab)

Supporting Components¶

Component	Purpose
cryo-base	Docker base image with pre-compiled Cryo binary and custom patches

Design Principles¶

All indexers in this layer follow common design principles:

Atomic processing -- Data is loaded in complete chunks. A range of blocks is either fully committed or not committed at all. Partial writes are avoided.

State tracking -- Each indexer maintains a state table in ClickHouse that records which ranges have been processed, enabling resumability and failure recovery.

Incremental operation -- Indexers support both historical backfill (bulk loading of past data) and continuous mode (following the chain tip in real time).

Containerized deployment -- Every indexer ships as a Docker image with Docker Compose configurations for straightforward deployment and orchestration.

Idempotency -- Reprocessing a range that has already been loaded produces the same result, using ReplacingMergeTree engines and deduplication strategies in ClickHouse.