Skip to content

Data Modeling

This section covers everything that turns raw blockchain data into analytics-ready datasets:

  • Data Transformation -- The dbt projects that build ~400 SQL models from raw ClickHouse tables
  • dbt Model Catalog -- Reference for all models across execution, consensus, bridges, P2P, contracts, ESG, crawlers, and ProbeLab

Data Transformation

The transformation layer converts raw blockchain data in ClickHouse into analytics-ready datasets using dbt (data build tool). Two projects handle this:

  • dbt-cerebro -- the core transformation project containing ~400 SQL models organized across eight domain modules
  • dbt-schema-gen -- an LLM-powered tool that generates and maintains schema.yml documentation files for dbt models

Components

Component Purpose
dbt-cerebro Core dbt project: ~400 models, 8 modules, incremental processing
Model Layers Explanation of the stg_* / int_* / fct_* / api_* naming convention and materialization strategy
Modules Reference The 8 domain modules with model counts, key models, and descriptions
ABI Decoding Contract ABI decoding system for converting raw transaction data into human-readable function calls and events
dbt-schema-gen LLM-powered schema documentation generator

Transformation Philosophy

The transformation layer follows several key principles:

Layered architecture -- Data flows through four distinct layers (staging, intermediate, facts, API), each with a clear purpose and materialization strategy. This separation enables independent testing and selective rebuilds.

Incremental processing -- Large tables are materialized as incremental models using ClickHouse's ReplacingMergeTree engine with delete+insert strategy and monthly partitioning. This avoids full table scans on every run.

Source of truth -- Raw tables in the execution, consensus, nebula, and crawlers_data databases remain untouched. All transformations produce new tables in the dbt database.

Convention over configuration -- Strict naming conventions (stg_, int_, fct_, api_) make models self-documenting and enable automation for API endpoint discovery.