Pipeline Overview¶

This document describes the end-to-end data pipeline that powers the Gnosis Analytics platform. The pipeline collects raw blockchain data from multiple sources, stores it in ClickHouse Cloud, transforms it through layered dbt models, and serves it through a REST API.

Data Sources¶

The pipeline ingests data from four categories of sources.

Execution Layer¶

The execution layer provides transaction-level blockchain data. The cryo-indexer connects to a Gnosis Chain RPC node and extracts:

Blocks -- headers, timestamps, gas usage, withdrawals
Transactions -- sender, receiver, value, gas, input data, status
Logs -- smart contract event emissions
Traces -- internal call trees and execution traces
Contracts -- contract creation events
Native transfers -- xDAI transfers (including internal)
State diffs -- balance, nonce, code, and storage changes

Consensus Layer¶

The consensus layer provides validator and attestation data. Two tools handle this:

beacon-indexer -- fetches live data from the beacon node REST API, covering blocks, validators, attestations, sync committees, and rewards
era-parser -- parses compressed historical era archive files (8192 slots per era), extracting all beacon chain data types across all forks from Phase 0 through Electra

Peer-to-Peer Network¶

Network topology data is collected by:

nebula -- a DHT crawler that discovers peers, records reachability, and captures client metadata (agent versions, protocols, fork digests)
ip-crawler -- enriches discovered peer IPs with geolocation data from ipinfo.io (city, country, ASN, organization)

External Sources¶

Additional datasets are ingested via click-runner:

Ember -- global electricity generation data (for ESG carbon footprint calculations)
ProbeLab -- daily P2P network metrics (agent versions, peer counts, crawl statistics)
Administrative queries -- schema migrations and maintenance SQL

Storage Architecture¶

All data lands in ClickHouse Cloud, organized across dedicated databases:

Database	Contents	Primary Indexer
`execution`	Blocks, transactions, logs, traces, contracts, state diffs	cryo-indexer
`consensus`	Beacon blocks, validators, attestations, sync committees, rewards	beacon-indexer, era-parser
`nebula`	Peer visits, peer metadata, crawl sessions	nebula
`crawlers_data`	IP geolocation, ProbeLab metrics	ip-crawler, click-runner
`dbt`	All transformed models (staging, intermediate, facts, API views)	dbt-cerebro

ClickHouse was chosen for its columnar storage, high compression ratios, and fast analytical query performance. Tables use ReplacingMergeTree engines with monthly partitioning for efficient incremental processing and deduplication.

Transformation Layer¶

Raw data is transformed by dbt-cerebro, a dbt project containing approximately 400 SQL models organized into eight domain modules. The transformation follows a strict layered architecture:

Raw Tables (execution.blocks, consensus.blocks, ...)
    |
    v
Staging (stg_*) -- Light cleanup, type casting, column renaming. Materialized as views.
    |
    v
Intermediate (int_*) -- Business logic, joins, aggregations. Materialized as incremental tables.
    |
    v
Facts (fct_*) -- Business-ready metrics and KPIs. Materialized as views.
    |
    v
API (api_*) -- Optimized for REST API consumption. Materialized as views.

Key transformation capabilities include:

Incremental processing using delete+insert strategy with monthly partitions
Contract ABI decoding that converts raw transaction input data and event logs into human-readable function calls and events
Cross-layer joins linking execution transactions to consensus proposers
Time-series aggregation at daily, weekly, and monthly grains

Serving Layer¶

Transformed data is consumed through:

REST API (cerebro-api) -- serves api_* model data as HTTP endpoints, with auto-discovery from the dbt manifest
MCP Tools -- AI-powered natural language interface for querying Gnosis Chain analytics
Dashboards -- visualization layer built on the API endpoints

Data Freshness¶

Data Type	Typical Latency	Update Mechanism
Execution layer (blocks, txs)	~2 minutes	cryo-indexer continuous mode (10s poll, 12 block confirmations)
Consensus layer (validators)	~5 minutes	beacon-indexer real-time sync
P2P network topology	~30 minutes	nebula crawl cycles
IP geolocation	~1 hour	ip-crawler continuous mode (60s batch intervals)
dbt transformations	Varies by model	Scheduled dbt runs
External sources (Ember, ProbeLab)	Daily	click-runner cron jobs