Data Modeling¶
This section covers everything that turns raw blockchain data into analytics-ready datasets:
- Data Transformation -- The dbt projects that build ~400 SQL models from raw ClickHouse tables
- dbt Model Catalog -- Reference for all models across execution, consensus, bridges, P2P, contracts, ESG, crawlers, and ProbeLab
Data Transformation¶
The transformation layer converts raw blockchain data in ClickHouse into analytics-ready datasets using dbt (data build tool). Two projects handle this:
- dbt-cerebro -- the core transformation project containing ~400 SQL models organized across eight domain modules
- dbt-schema-gen -- an LLM-powered tool that generates and maintains
schema.ymldocumentation files for dbt models
Components¶
| Component | Purpose |
|---|---|
| dbt-cerebro | Core dbt project: ~400 models, 8 modules, incremental processing |
| Model Layers | Explanation of the stg_* / int_* / fct_* / api_* naming convention and materialization strategy |
| Modules Reference | The 8 domain modules with model counts, key models, and descriptions |
| ABI Decoding | Contract ABI decoding system for converting raw transaction data into human-readable function calls and events |
| dbt-schema-gen | LLM-powered schema documentation generator |
Transformation Philosophy¶
The transformation layer follows several key principles:
Layered architecture -- Data flows through four distinct layers (staging, intermediate, facts, API), each with a clear purpose and materialization strategy. This separation enables independent testing and selective rebuilds.
Incremental processing -- Large tables are materialized as incremental models using ClickHouse's ReplacingMergeTree engine with delete+insert strategy and monthly partitioning. This avoids full table scans on every run.
Source of truth -- Raw tables in the execution, consensus, nebula, and crawlers_data databases remain untouched. All transformations produce new tables in the dbt database.
Convention over configuration -- Strict naming conventions (stg_, int_, fct_, api_) make models self-documenting and enable automation for API endpoint discovery.