Skip to content

Architecture

File layout

The semantic layer lives in two places in the dbt-cerebro repo:

dbt-cerebro/
├── models/                     -- dbt models (sources, staging, int_, fct_, api_)
│   ├── shared/marts/dim_time_spine_{daily,weekly,monthly}.sql
│   ├── revenue/marts/fct_revenue_per_user_{weekly,monthly}.sql
│   ├── execution/gpay/marts/fct_execution_gpay_users_distinct.sql
│   ├── execution/gnosis_app/marts/fct_execution_gnosis_app_users_distinct.sql
│   └── execution/Circles/marts/fct_execution_circles_human_avatars_distinct.sql
└── semantic/                   -- authoring + relationships, NOT dbt models
    ├── authoring/
    │   ├── consensus/semantic_models.yml
    │   ├── execution/{cow,gpay,gnosis_app,lending,Circles,...}/semantic_models.yml
    │   ├── revenue/semantic_models.yml
    │   ├── bridges/semantic_models.yml
    │   ├── mixpanel_ga/semantic_models.yml
    │   └── shared/semantic_models.yml
    ├── relationships/
    │   ├── time_spines.yml          -- cross-grain bridges
    │   ├── user_pseudonym.yml       -- cross-sector user joins
    │   ├── execution_graph.yml      -- entity-graph joins (Circles, Safe, validator)
    │   └── execution_transactions.yml
    └── overrides/
        └── defaults.yml             -- metric aliases, docs enrichments

The compiled output lives in target/semantic_registry.json (one of the artifacts produced by dbt run / dbt compile + the post-build script). The MCP server reads this artifact remotely from GitHub Pages.

What each piece does

models/ — dbt SQL

The actual SQL that materializes data in ClickHouse. The semantic layer adds zero SQL — it only references models that already exist. The exception is the small set of user-keyed marts and time spines that were created specifically as cross-sector entry points:

  • dim_time_spine_{daily,weekly,monthly} — reference time dimensions
  • fct_revenue_per_user_{weekly,monthly} — pseudonym-keyed per-user projections of int_revenue_fees_{weekly,monthly}_per_user
  • fct_execution_gpay_users_distinct — deduped int_execution_gpay_safe_identities
  • fct_execution_gnosis_app_users_distinct — flag-enriched projection of int_execution_gnosis_app_user_identities
  • fct_execution_circles_human_avatars_distinct — Human-filtered projection of api_execution_circles_v2_avatar_metadata

Everything else in models/ predates the semantic-layer work.

semantic/authoring/<module>/semantic_models.yml

One file per module. Two top-level blocks:

semantic_models:
  - name: <model_or_alias>
    model: ref('<dbt_model>')
    entities:
      - name: user_pseudonym       # for cross-sector joins
        type: primary
        expr: toString(user_pseudonym)
    dimensions:
      - name: week
        type: time
        expr: week                 # may include derivations: addDays(week, 1) etc.
        type_params:
          time_granularity: week
      - name: protocol
        type: categorical
        expr: label                # the underlying SQL column
    measures:
      - name: lending_deposits_volume_value    # MUST be globally unique
        agg: sum                                # see `_AGG_TO_CLICKHOUSE` in cerebro-mcp
        expr: deposits_volume_weekly
    config:
      meta:
        cerebro:
          owner: analytics_team
          quality_tier: approved   # approved / candidate / blocked
          grain: week
          question_synonyms:
            - lending deposits weekly
            - aave deposits

metrics:
  - name: lending_deposits_volume_weekly       # the metric ID callers use
    label: Lending Deposits Volume (Weekly)
    description: >
      Sum of weekly deposit volume per (protocol, token).
    type: simple
    type_params:
      measure: lending_deposits_volume_value
    config:
      meta:
        cerebro:
          quality_tier: approved
          grain: week
          owner: analytics_team
          allowed_dimensions: [week, symbol, token_class, protocol]
          supported_time_grains: [week, month]
          question_synonyms:
            - lending deposits weekly
            - aave deposits weekly

A few invariants the build enforces (see maintenance):

  • Measure names are globally unique. Two value_value measures in two different semantic_models are an ambiguous_measure_binding error.
  • Metrics resolve deterministically. build_metrics() keys measure_to_models as a sorted-first dict so the registry is reproducible even when an ambiguity exists.
  • A metric's root model must be semantic_status: approved for the metric to be executable. Common gotcha: you add a candidate semantic_model and an approved metric pointing at it — the metric appears in discover_metrics but query_metrics rejects it.

semantic/relationships/*.yml

Declare cross-sector joins. The MCP planner reads these to find paths between metric roots. Three kinds in production today:

File Axis Purpose
time_spines.yml day / week / month Cross-grain composition. Every weekly mart joins to dim_time_spine_weekly.
user_pseudonym.yml user_pseudonym Cross-sector user-overlap. 4-node graph (revenue × gpay × gnosis_app × circles).
execution_graph.yml circles_avatar, safe, validator, ... Entity-specific joins (Circles trust graph, GP wallet ↔ Safe owner, validator withdrawal address ↔ Safe).

Relationship shape:

relationships:
  - name: revenue_per_user_weekly_to_gpay_identity
    left_model: fct_revenue_per_user_weekly
    right_model: fct_execution_gpay_users_distinct
    left_keys: [user_pseudonym]
    right_keys: [user_pseudonym]
    cardinality: many_to_many
    join_semantics: inner
    via_entity: user_pseudonym
    preferred_bridge: true
    safe_for_dimension_enrichment: true
    aggregate_then_join_only: false
    allow_any_join: false
    quality_tier: approved

The fields the planner actually uses:

  • left_model / right_model / left_keys / right_keys — join condition.
  • via_entity — the entity axis. Edges sharing a via_entity form a reachability graph.
  • cardinality + join_semantics — affects which join type the planner emits (INNER vs LEFT).
  • preferred_bridge — used for cost weighting in find_safest_path (cerebro-mcp semantic_graph.py).
  • quality_tier — only approved relationships are eligible for the planner's path search. Lets you stage new joins as candidate.

target/semantic_registry.json

The build artifact. Schema:

{
  "metadata": { "manifest_hash": "...", "catalog_hash": "..." },
  "models": {
    "<model_name>": {
      "name": "...",
      "semantic_status": "approved" | "candidate" | "docs_only",
      "dimensions": [...],
      "measures": [...],
      "module": "execution",
      "tags": [...]
    }
  },
  "metrics": {
    "<metric_name>": {
      "measure": "...",
      "root_model": "...",
      "quality_tier": "approved" | "candidate" | "blocked",
      "semantic_status": "approved" | "candidate",
      "allowed_dimensions": [...],
      "supported_time_grains": [...],
      "question_synonyms": [...]
    }
  },
  "relationships": [...],
  "coverage_summary": {...}
}

The MCP server reads this whole file at startup and on each refresh (default poll: 300s; force via reload_semantic_registry).

Build flow

                 ┌──────────────────────────────────────┐
                 │   Author writes / edits files in:    │
                 │   models/**/*.sql                    │
                 │   models/**/schema.yml               │
                 │   semantic/authoring/**/*.yml        │
                 │   semantic/relationships/**/*.yml    │
                 └──────────────────────────────────────┘
                 ┌──────────────────────────────────────┐
                 │  dbt run / dbt compile               │
                 │  → target/manifest.json              │
                 │  → target/catalog.json               │
                 │  → target/semantic_manifest.json     │
                 └──────────────────────────────────────┘
                 ┌──────────────────────────────────────┐
                 │  python3 scripts/semantic/           │
                 │    build_registry.py                 │
                 │  → target/semantic_registry.json     │
                 │  → target/semantic_validation_       │
                 │     report.json                      │
                 └──────────────────────────────────────┘
                 ┌──────────────────────────────────────┐
                 │  Publish to GitHub Pages             │
                 │  (https://gnosischain.github.io/     │
                 │   dbt-cerebro/semantic_registry.json)│
                 └──────────────────────────────────────┘
                 ┌──────────────────────────────────────┐
                 │  cerebro-mcp polls every 300s        │
                 │  (or force-refresh via               │
                 │   reload_semantic_registry tool)     │
                 └──────────────────────────────────────┘
                 ┌──────────────────────────────────────┐
                 │  AI agent / dashboard / analyst:     │
                 │  discover_metrics → query_metrics    │
                 └──────────────────────────────────────┘

Validation pass

build_registry.py --validate runs all of these checks. Each maps to a specific error code in target/semantic_validation_report.json:

Error code Cause Fix
ambiguous_measure_binding Two semantic_models declare a measure with the same name; the metric pointing at it is non-deterministic. Rename the measure on the intended source to be unique (convention: <metric_name>_value).
missing_measure A metric references a measure name that no semantic_model declares. Typo or stale reference. Either remove the metric or add the measure.
metric_missing_root_model The measure binding produced no root_model. Almost always a downstream effect of missing_measure or ambiguous_measure_binding. Fix the underlying measure binding.
unknown_left_model / unknown_right_model A relationship references a dbt model name not in target/manifest.json. Either build the missing model or fix the typo in the relationship file.
approved_model_missing_measures A semantic_model tagged quality_tier: approved has no measures: block. Either add measures or demote to candidate.
allow_any_join_not_approved (warning) An allow_any_join: true relationship is not approved. Joins that collapse duplicates need explicit approval. Either set quality_tier: approved and review the join, or set allow_any_join: false.

The validation is non-fatal by default; CI should run it with --validate and fail the build on errors.

Where the planner code lives

Bug fixes and feature work for the planner happen in the cerebro-mcp repo (separate from dbt-cerebro). Key files:

  • src/cerebro_mcp/semantic_sql_compiler.py — emits ClickHouse SQL from the metric plan. The _AGG_TO_CLICKHOUSE map lives here (translates count_distinctuniqExact, etc.).
  • src/cerebro_mcp/semantic_planner.py — resolves dimension bindings, including the time-spine upcast (_try_time_spine_upcast).
  • src/cerebro_mcp/semantic_graph.py — builds the reachability graph from relationships and runs find_safest_path.
  • src/cerebro_mcp/semantic_loader.py — registry refresh, ETag-based poll, and the force-reload path used by reload_semantic_registry.
  • src/cerebro_mcp/tools/semantic.py — the MCP-tool surface (discover_metrics, query_metrics, explain_metric_query, reload_semantic_registry).

Changes to the planner or the SQL compiler should land as small PRs with unit-test coverage (see tests/test_semantic_*.py in cerebro-mcp).