Adding dbt Models¶

All analytics transformations in the Gnosis Analytics platform happen in dbt-cerebro, a dbt project containing approximately 400 SQL models. This guide covers everything you need to create new models: naming conventions, layer selection, materialization strategies, testing, and the PR workflow.

Naming Convention¶

Every model name follows a structured pattern:

{layer}_{module}_{entity}_{granularity}

Segment	Description	Examples
`layer`	Processing layer prefix	`stg`, `int`, `fct`, `api`
`module`	Domain module	`execution`, `consensus`, `p2p`, `bridges`, `esg`, `contracts`, `probelab`, `crawlers`
`entity`	What the model represents	`blocks`, `transactions`, `validators`, `peers`, `transfers`
`granularity`	Time aggregation level (optional)	`daily`, `weekly`, `monthly`

Examples:

Model Name	Layer	Module	Entity	Granularity
`stg_execution__blocks`	Staging	Execution	Blocks	--
`int_execution_blocks_clients_version_daily`	Intermediate	Execution	Block clients	Daily
`fct_consensus_attestation_rates`	Facts	Consensus	Attestation rates	--
`api_p2p_client_distribution_daily`	API	P2P	Client distribution	Daily

Staging model naming

Staging models use a double underscore __ to separate the source name from the table name: stg_{source}__{table}. This distinguishes the source system boundary from the rest of the name.

Model Layers¶

Models are organized into four layers, each with a specific purpose and materialization strategy. Data flows from staging through to the API layer.

Source Tables --> stg_* (Views) --> int_* (Incremental) --> fct_* (Views) --> api_* (Views)

Staging Layer (`stg_*`)¶

Staging models provide a clean interface over raw source tables. They apply minimal transformations: type casting, column renaming, and format standardization. No business logic belongs in this layer.

Property	Value
Materialization	`view`
Purpose	Clean raw data, normalize types and formats
References	`source()` only
Ratio	1:1 with source tables

-- models/execution/staging/stg_execution__blocks.sql
{{ config(materialized='view') }}

SELECT
    block_number,
    CONCAT('0x', author) AS author,
    block_timestamp,
    gas_used,
    gas_limit,
    base_fee_per_gas
FROM {{ source('execution', 'blocks') }}

Intermediate Layer (`int_*`)¶

Intermediate models contain the heavy transformation logic: joins, aggregations, window functions, and business calculations. This layer does the heavy lifting and produces the datasets that downstream layers reference.

Property	Value
Materialization	`incremental` (table)
Purpose	Business logic, joins, aggregations
References	`stg_` and other `int_` models
Strategy	`delete+insert` with monthly partitioning

-- models/consensus/intermediate/int_consensus_attestation_participation_daily.sql
{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        engine='ReplacingMergeTree()',
        order_by='(date)',
        unique_key='(date)',
        partition_by='toStartOfMonth(date)'
    )
}}

SELECT
    toDate(slot_timestamp) AS date,
    countIf(included = true) AS included_attestations,
    count() AS total_attestations,
    round(countIf(included = true) / count() * 100, 2) AS participation_rate
FROM {{ ref('stg_consensus__attestations') }}

{{ apply_monthly_incremental_filter('slot_timestamp', 'date') }}

GROUP BY date

Facts Layer (`fct_*`)¶

Fact models produce business-ready metrics and KPIs. They combine intermediate models and apply final calculations like percentages, rankings, or rolling averages.

Property	Value
Materialization	`view`
Purpose	Business-ready metrics, calculated KPIs
References	`int_*` models

-- models/consensus/marts/fct_consensus_attestation_rates.sql
{{ config(materialized='view') }}

SELECT
    date,
    included_attestations,
    total_attestations,
    participation_rate,
    avg(participation_rate) OVER (
        ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS participation_rate_7d_avg
FROM {{ ref('int_consensus_attestation_participation_daily') }}

API Layer (`api_*`)¶

API models are the final presentation layer designed for consumption by the REST API. They project a flat, denormalized structure optimized for JSON serialization. These models include the production and api:* tags that trigger auto-discovery by the API server.

Property	Value
Materialization	`view`
Purpose	REST API endpoint data
References	`fct_` and `int_` models
Tags	`production`, `api:*`, category, optional tier and granularity

-- models/consensus/marts/api_consensus_attestation_rates_daily.sql
{{
    config(
        materialized='view',
        tags=['production', 'consensus', 'tier1', 'api:attestation_rates', 'granularity:daily'],
        meta={
            "api": {
                "methods": ["GET"],
                "allow_unfiltered": false,
                "parameters": [
                    {"name": "start_date", "column": "date", "operator": ">=", "type": "date"},
                    {"name": "end_date", "column": "date", "operator": "<=", "type": "date"}
                ],
                "pagination": {"enabled": true, "default_limit": 100, "max_limit": 5000},
                "sort": [{"column": "date", "direction": "DESC"}]
            }
        }
    )
}}

SELECT
    date,
    included_attestations,
    total_attestations,
    participation_rate,
    participation_rate_7d_avg
FROM {{ ref('fct_consensus_attestation_rates') }}

Layer Dependencies¶

Models must only reference models in their own layer or the layer directly above:

Model Layer	Can Reference
`stg_*`	`source()` only
`int_*`	`stg_` and other `int_` models
`fct_*`	`int_*` models
`api_*`	`fct_` and `int_` models

No skip-layer references

An api_* model should not reference a stg_* model directly. If you need staging data in the API layer, create an intermediate model to bridge the gap. This keeps the dependency graph clean and ensures data passes through appropriate transformation steps.

Module Placement¶

Place your model files in the correct module directory:

models/
├── execution/           # Blocks, transactions, gas, contracts, tokens
│   ├── staging/
│   ├── intermediate/
│   └── marts/
├── consensus/           # Validators, attestations, proposals, blobs
│   ├── staging/
│   ├── intermediate/
│   └── marts/
├── bridges/             # Cross-chain bridge transfers and volume
│   ├── staging/
│   ├── intermediate/
│   └── marts/
├── p2p/                 # Peer discovery, client diversity, geography
│   ├── staging/
│   ├── intermediate/
│   └── marts/
├── contracts/           # Decoded smart contract data, protocol-specific
│   └── {protocol}/      # One subdirectory per protocol
├── ESG/                 # Energy, carbon, sustainability metrics
│   ├── staging/
│   ├── intermediate/
│   └── marts/
├── probelab/            # ProbeLab network performance metrics
│   └── ...
└── crawlers/            # Crawler data transformations
    └── ...

Choose the module based on the data domain. If your model bridges multiple domains (e.g., validator data with geographic enrichment), place it in the primary domain and reference models from other modules.

Incremental Strategy¶

Most intermediate models use ClickHouse's incremental materialization with the delete+insert strategy. This is the recommended approach for all time-series data.

Configuration Template¶

{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        engine='ReplacingMergeTree()',
        order_by='(date, dimension_column)',
        unique_key='(date, dimension_column)',
        partition_by='toStartOfMonth(date)'
    )
}}

SELECT
    toDate(timestamp_column) AS date,
    dimension_column,
    count() AS metric_value
FROM {{ ref('stg_source__table') }}

{{ apply_monthly_incremental_filter('timestamp_column', 'date') }}

GROUP BY date, dimension_column

How Incremental Processing Works¶

The apply_monthly_incremental_filter macro limits processing to only the months containing new data:

First run (full refresh): All historical data is processed
Incremental runs: Only months where new source data exists are reprocessed
Delete+Insert: Existing rows in affected partitions are deleted, then fresh rows are inserted

This approach is:

Idempotent -- Safe to re-run without creating duplicates
Efficient -- Only processes changed partitions, not the entire table
Correct -- Handles late-arriving data by reprocessing entire monthly partitions

Key Configuration Fields¶

Field	Purpose	Example
`engine`	ClickHouse table engine	`ReplacingMergeTree()`
`order_by`	Primary sort key (must match `unique_key`)	`(date, client_name)`
`unique_key`	Deduplication key	`(date, client_name)`
`partition_by`	Partitioning expression	`toStartOfMonth(date)`

Always use ReplacingMergeTree

Even though delete+insert handles deduplication at the dbt level, ReplacingMergeTree provides an additional safety net by deduplicating rows during ClickHouse background merges. This guards against edge cases where a run is interrupted mid-partition.

Testing¶

Schema Tests¶

Add tests in the schema.yml file alongside your model:

version: 2

models:
  - name: int_consensus_attestation_participation_daily
    description: Daily attestation participation rates
    columns:
      - name: date
        description: Calendar date (UTC)
        data_type: Date
        tests:
          - not_null
          - unique
      - name: participation_rate
        description: Percentage of attestations included in blocks
        data_type: Float64
        tests:
          - not_null
      - name: total_attestations
        description: Total number of attestations expected
        data_type: UInt64
        tests:
          - not_null

Common Test Types¶

Test	Purpose	Example
`not_null`	Column must not contain NULL values	Primary keys, dates, metrics
`unique`	Column values must be unique	Date columns in daily aggregations
`accepted_values`	Column must contain only specified values	Status fields, categories
`relationships`	Foreign key integrity	`ref()` columns linking to dimension tables

Data Tests¶

For more complex validation, create SQL-based data tests in tests/:

-- tests/assert_no_future_dates.sql
SELECT date
FROM {{ ref('int_consensus_attestation_participation_daily') }}
WHERE date > today()

This test passes if the query returns zero rows.

Running Tests¶

# Test a specific model
dbt test --select int_consensus_attestation_participation_daily

# Test all models in a module
dbt test --select consensus

# Test with upstream dependencies
dbt test --select +api_consensus_attestation_rates_daily

Documentation¶

Every model must have a schema.yml entry with:

Model description -- What data this model contains and its business purpose
Column descriptions -- Human-readable description for every column
Data types -- ClickHouse data type for each column
Tests -- At minimum not_null on primary key columns

version: 2

models:
  - name: api_consensus_attestation_rates_daily
    description: >
      Daily attestation participation rates for the Gnosis Chain consensus layer.
      Includes raw counts and rolling averages for trend analysis.
    columns:
      - name: date
        description: Calendar date (UTC)
        data_type: Date
        tests:
          - not_null
      - name: included_attestations
        description: Number of attestations successfully included in blocks
        data_type: UInt64
      - name: total_attestations
        description: Total expected attestations based on active validator count
        data_type: UInt64
      - name: participation_rate
        description: Percentage of attestations included (0-100)
        data_type: Float64
      - name: participation_rate_7d_avg
        description: 7-day rolling average of participation rate
        data_type: Float64

Use dbt-schema-gen

The dbt-schema-gen tool can automatically generate schema.yml files using LLM analysis of your SQL models. Run it against your new model to get a starting point, then refine the descriptions manually.

Tags¶

Required Tags for API Models¶

Tag	Purpose
`production`	Marks the model for API exposure
`api:{resource_name}`	Defines the URL resource segment
Category tag (e.g., `consensus`)	URL prefix and Swagger UI grouping

Optional Tags¶

Tag	Purpose	Default
`tier0` through `tier3`	Access control level	`tier0`
`granularity:{period}`	URL suffix for time dimension	None

Tags for Non-API Models¶

Intermediate and staging models do not need production or api:* tags. However, adding the module name as a tag helps with selective dbt runs:

{{ config(
    materialized='incremental',
    tags=['consensus']
) }}

This allows running all consensus models with dbt run --select tag:consensus.

PR Workflow¶

1. Create a Branch¶

git checkout -b feat/add-attestation-rates-model

2. Add Your Model Files¶

SQL model file in the appropriate models/{module}/{layer}/ directory
schema.yml entry with descriptions and tests

3. Test Locally¶

# Compile to check SQL syntax
dbt compile --select api_consensus_attestation_rates_daily

# Run the model
dbt run --select +api_consensus_attestation_rates_daily

# Run tests
dbt test --select api_consensus_attestation_rates_daily

# Preview results
dbt show --select api_consensus_attestation_rates_daily --limit 10

# Generate documentation
dbt docs generate

4. Submit a Pull Request¶

Include in your PR description:

What data the model exposes and its business purpose
The resulting API endpoint path and access tier
Sample query results (output from dbt show)
Any upstream model dependencies that were added or modified

5. After Merge¶

Once your PR is merged:

CI runs dbt build to deploy the model to ClickHouse
An updated manifest.json is published
The API auto-discovers the new endpoint within 5 minutes
The endpoint appears in the Swagger UI documentation

Next Steps¶

Adding API Endpoints -- Focus specifically on the meta.api configuration
meta.api Contract -- Full specification for API metadata
Model Layers -- Detailed layer reference
Conventions -- Naming and code style standards