Adding dbt Models¶
All analytics transformations in the Gnosis Analytics platform happen in dbt-cerebro, a dbt project containing approximately 400 SQL models. This guide covers everything you need to create new models: naming conventions, layer selection, materialization strategies, testing, and the PR workflow.
Naming Convention¶
Every model name follows a structured pattern:
| Segment | Description | Examples |
|---|---|---|
layer | Processing layer prefix | stg, int, fct, api |
module | Domain module | execution, consensus, p2p, bridges, esg, contracts, probelab, crawlers |
entity | What the model represents | blocks, transactions, validators, peers, transfers |
granularity | Time aggregation level (optional) | daily, weekly, monthly |
Examples:
| Model Name | Layer | Module | Entity | Granularity |
|---|---|---|---|---|
stg_execution__blocks | Staging | Execution | Blocks | -- |
int_execution_blocks_clients_version_daily | Intermediate | Execution | Block clients | Daily |
fct_consensus_attestation_rates | Facts | Consensus | Attestation rates | -- |
api_p2p_client_distribution_daily | API | P2P | Client distribution | Daily |
Staging model naming
Staging models use a double underscore __ to separate the source name from the table name: stg_{source}__{table}. This distinguishes the source system boundary from the rest of the name.
Model Layers¶
Models are organized into four layers, each with a specific purpose and materialization strategy. Data flows from staging through to the API layer.
Staging Layer (stg_*)¶
Staging models provide a clean interface over raw source tables. They apply minimal transformations: type casting, column renaming, and format standardization. No business logic belongs in this layer.
| Property | Value |
|---|---|
| Materialization | view |
| Purpose | Clean raw data, normalize types and formats |
| References | source() only |
| Ratio | 1:1 with source tables |
-- models/execution/staging/stg_execution__blocks.sql
{{ config(materialized='view') }}
SELECT
block_number,
CONCAT('0x', author) AS author,
block_timestamp,
gas_used,
gas_limit,
base_fee_per_gas
FROM {{ source('execution', 'blocks') }}
Intermediate Layer (int_*)¶
Intermediate models contain the heavy transformation logic: joins, aggregations, window functions, and business calculations. This layer does the heavy lifting and produces the datasets that downstream layers reference.
| Property | Value |
|---|---|
| Materialization | incremental (table) |
| Purpose | Business logic, joins, aggregations |
| References | stg_* and other int_* models |
| Strategy | delete+insert with monthly partitioning |
-- models/consensus/intermediate/int_consensus_attestation_participation_daily.sql
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
engine='ReplacingMergeTree()',
order_by='(date)',
unique_key='(date)',
partition_by='toStartOfMonth(date)'
)
}}
SELECT
toDate(slot_timestamp) AS date,
countIf(included = true) AS included_attestations,
count() AS total_attestations,
round(countIf(included = true) / count() * 100, 2) AS participation_rate
FROM {{ ref('stg_consensus__attestations') }}
{{ apply_monthly_incremental_filter('slot_timestamp', 'date') }}
GROUP BY date
Facts Layer (fct_*)¶
Fact models produce business-ready metrics and KPIs. They combine intermediate models and apply final calculations like percentages, rankings, or rolling averages.
| Property | Value |
|---|---|
| Materialization | view |
| Purpose | Business-ready metrics, calculated KPIs |
| References | int_* models |
-- models/consensus/marts/fct_consensus_attestation_rates.sql
{{ config(materialized='view') }}
SELECT
date,
included_attestations,
total_attestations,
participation_rate,
avg(participation_rate) OVER (
ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS participation_rate_7d_avg
FROM {{ ref('int_consensus_attestation_participation_daily') }}
API Layer (api_*)¶
API models are the final presentation layer designed for consumption by the REST API. They project a flat, denormalized structure optimized for JSON serialization. These models include the production and api:* tags that trigger auto-discovery by the API server.
| Property | Value |
|---|---|
| Materialization | view |
| Purpose | REST API endpoint data |
| References | fct_* and int_* models |
| Tags | production, api:*, category, optional tier and granularity |
-- models/consensus/marts/api_consensus_attestation_rates_daily.sql
{{
config(
materialized='view',
tags=['production', 'consensus', 'tier1', 'api:attestation_rates', 'granularity:daily'],
meta={
"api": {
"methods": ["GET"],
"allow_unfiltered": false,
"parameters": [
{"name": "start_date", "column": "date", "operator": ">=", "type": "date"},
{"name": "end_date", "column": "date", "operator": "<=", "type": "date"}
],
"pagination": {"enabled": true, "default_limit": 100, "max_limit": 5000},
"sort": [{"column": "date", "direction": "DESC"}]
}
}
)
}}
SELECT
date,
included_attestations,
total_attestations,
participation_rate,
participation_rate_7d_avg
FROM {{ ref('fct_consensus_attestation_rates') }}
Layer Dependencies¶
Models must only reference models in their own layer or the layer directly above:
| Model Layer | Can Reference |
|---|---|
stg_* | source() only |
int_* | stg_* and other int_* models |
fct_* | int_* models |
api_* | fct_* and int_* models |
No skip-layer references
An api_* model should not reference a stg_* model directly. If you need staging data in the API layer, create an intermediate model to bridge the gap. This keeps the dependency graph clean and ensures data passes through appropriate transformation steps.
Module Placement¶
Place your model files in the correct module directory:
models/
├── execution/ # Blocks, transactions, gas, contracts, tokens
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── consensus/ # Validators, attestations, proposals, blobs
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── bridges/ # Cross-chain bridge transfers and volume
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── p2p/ # Peer discovery, client diversity, geography
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── contracts/ # Decoded smart contract data, protocol-specific
│ └── {protocol}/ # One subdirectory per protocol
├── ESG/ # Energy, carbon, sustainability metrics
│ ├── staging/
│ ├── intermediate/
│ └── marts/
├── probelab/ # ProbeLab network performance metrics
│ └── ...
└── crawlers/ # Crawler data transformations
└── ...
Choose the module based on the data domain. If your model bridges multiple domains (e.g., validator data with geographic enrichment), place it in the primary domain and reference models from other modules.
Incremental Strategy¶
Most intermediate models use ClickHouse's incremental materialization with the delete+insert strategy. This is the recommended approach for all time-series data.
Configuration Template¶
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
engine='ReplacingMergeTree()',
order_by='(date, dimension_column)',
unique_key='(date, dimension_column)',
partition_by='toStartOfMonth(date)'
)
}}
SELECT
toDate(timestamp_column) AS date,
dimension_column,
count() AS metric_value
FROM {{ ref('stg_source__table') }}
{{ apply_monthly_incremental_filter('timestamp_column', 'date') }}
GROUP BY date, dimension_column
How Incremental Processing Works¶
The apply_monthly_incremental_filter macro limits processing to only the months containing new data:
- First run (full refresh): All historical data is processed
- Incremental runs: Only months where new source data exists are reprocessed
- Delete+Insert: Existing rows in affected partitions are deleted, then fresh rows are inserted
This approach is:
- Idempotent -- Safe to re-run without creating duplicates
- Efficient -- Only processes changed partitions, not the entire table
- Correct -- Handles late-arriving data by reprocessing entire monthly partitions
Key Configuration Fields¶
| Field | Purpose | Example |
|---|---|---|
engine | ClickHouse table engine | ReplacingMergeTree() |
order_by | Primary sort key (must match unique_key) | (date, client_name) |
unique_key | Deduplication key | (date, client_name) |
partition_by | Partitioning expression | toStartOfMonth(date) |
Always use ReplacingMergeTree
Even though delete+insert handles deduplication at the dbt level, ReplacingMergeTree provides an additional safety net by deduplicating rows during ClickHouse background merges. This guards against edge cases where a run is interrupted mid-partition.
Testing¶
Schema Tests¶
Add tests in the schema.yml file alongside your model:
version: 2
models:
- name: int_consensus_attestation_participation_daily
description: Daily attestation participation rates
columns:
- name: date
description: Calendar date (UTC)
data_type: Date
tests:
- not_null
- unique
- name: participation_rate
description: Percentage of attestations included in blocks
data_type: Float64
tests:
- not_null
- name: total_attestations
description: Total number of attestations expected
data_type: UInt64
tests:
- not_null
Common Test Types¶
| Test | Purpose | Example |
|---|---|---|
not_null | Column must not contain NULL values | Primary keys, dates, metrics |
unique | Column values must be unique | Date columns in daily aggregations |
accepted_values | Column must contain only specified values | Status fields, categories |
relationships | Foreign key integrity | ref() columns linking to dimension tables |
Data Tests¶
For more complex validation, create SQL-based data tests in tests/:
-- tests/assert_no_future_dates.sql
SELECT date
FROM {{ ref('int_consensus_attestation_participation_daily') }}
WHERE date > today()
This test passes if the query returns zero rows.
Running Tests¶
# Test a specific model
dbt test --select int_consensus_attestation_participation_daily
# Test all models in a module
dbt test --select consensus
# Test with upstream dependencies
dbt test --select +api_consensus_attestation_rates_daily
Documentation¶
Every model must have a schema.yml entry with:
- Model description -- What data this model contains and its business purpose
- Column descriptions -- Human-readable description for every column
- Data types -- ClickHouse data type for each column
- Tests -- At minimum
not_nullon primary key columns
version: 2
models:
- name: api_consensus_attestation_rates_daily
description: >
Daily attestation participation rates for the Gnosis Chain consensus layer.
Includes raw counts and rolling averages for trend analysis.
columns:
- name: date
description: Calendar date (UTC)
data_type: Date
tests:
- not_null
- name: included_attestations
description: Number of attestations successfully included in blocks
data_type: UInt64
- name: total_attestations
description: Total expected attestations based on active validator count
data_type: UInt64
- name: participation_rate
description: Percentage of attestations included (0-100)
data_type: Float64
- name: participation_rate_7d_avg
description: 7-day rolling average of participation rate
data_type: Float64
Use dbt-schema-gen
The dbt-schema-gen tool can automatically generate schema.yml files using LLM analysis of your SQL models. Run it against your new model to get a starting point, then refine the descriptions manually.
Tags¶
Required Tags for API Models¶
| Tag | Purpose |
|---|---|
production | Marks the model for API exposure |
api:{resource_name} | Defines the URL resource segment |
Category tag (e.g., consensus) | URL prefix and Swagger UI grouping |
Optional Tags¶
| Tag | Purpose | Default |
|---|---|---|
tier0 through tier3 | Access control level | tier0 |
granularity:{period} | URL suffix for time dimension | None |
Tags for Non-API Models¶
Intermediate and staging models do not need production or api:* tags. However, adding the module name as a tag helps with selective dbt runs:
This allows running all consensus models with dbt run --select tag:consensus.
PR Workflow¶
1. Create a Branch¶
2. Add Your Model Files¶
- SQL model file in the appropriate
models/{module}/{layer}/directory schema.ymlentry with descriptions and tests
3. Test Locally¶
# Compile to check SQL syntax
dbt compile --select api_consensus_attestation_rates_daily
# Run the model
dbt run --select +api_consensus_attestation_rates_daily
# Run tests
dbt test --select api_consensus_attestation_rates_daily
# Preview results
dbt show --select api_consensus_attestation_rates_daily --limit 10
# Generate documentation
dbt docs generate
4. Submit a Pull Request¶
Include in your PR description:
- What data the model exposes and its business purpose
- The resulting API endpoint path and access tier
- Sample query results (output from
dbt show) - Any upstream model dependencies that were added or modified
5. After Merge¶
Once your PR is merged:
- CI runs
dbt buildto deploy the model to ClickHouse - An updated
manifest.jsonis published - The API auto-discovers the new endpoint within 5 minutes
- The endpoint appears in the Swagger UI documentation
Next Steps¶
- Adding API Endpoints -- Focus specifically on the
meta.apiconfiguration - meta.api Contract -- Full specification for API metadata
- Model Layers -- Detailed layer reference
- Conventions -- Naming and code style standards