Architecture¶
File layout¶
The semantic layer lives in two places in the dbt-cerebro repo:
dbt-cerebro/
├── models/ -- dbt models (sources, staging, int_, fct_, api_)
│ ├── shared/marts/dim_time_spine_{daily,weekly,monthly}.sql
│ ├── revenue/marts/fct_revenue_per_user_{weekly,monthly}.sql
│ ├── execution/gpay/marts/fct_execution_gpay_users_distinct.sql
│ ├── execution/gnosis_app/marts/fct_execution_gnosis_app_users_distinct.sql
│ └── execution/Circles/marts/fct_execution_circles_human_avatars_distinct.sql
│
└── semantic/ -- authoring + relationships, NOT dbt models
├── authoring/
│ ├── consensus/semantic_models.yml
│ ├── execution/{cow,gpay,gnosis_app,lending,Circles,...}/semantic_models.yml
│ ├── revenue/semantic_models.yml
│ ├── bridges/semantic_models.yml
│ ├── mixpanel_ga/semantic_models.yml
│ └── shared/semantic_models.yml
├── relationships/
│ ├── time_spines.yml -- cross-grain bridges
│ ├── user_pseudonym.yml -- cross-sector user joins
│ ├── execution_graph.yml -- entity-graph joins (Circles, Safe, validator)
│ └── execution_transactions.yml
└── overrides/
└── defaults.yml -- metric aliases, docs enrichments
The compiled output lives in target/semantic_registry.json (one of the artifacts produced by dbt run / dbt compile + the post-build script). The MCP server reads this artifact remotely from GitHub Pages.
What each piece does¶
models/ — dbt SQL¶
The actual SQL that materializes data in ClickHouse. The semantic layer adds zero SQL — it only references models that already exist. The exception is the small set of user-keyed marts and time spines that were created specifically as cross-sector entry points:
dim_time_spine_{daily,weekly,monthly}— reference time dimensionsfct_revenue_per_user_{weekly,monthly}— pseudonym-keyed per-user projections ofint_revenue_fees_{weekly,monthly}_per_userfct_execution_gpay_users_distinct— dedupedint_execution_gpay_safe_identitiesfct_execution_gnosis_app_users_distinct— flag-enriched projection ofint_execution_gnosis_app_user_identitiesfct_execution_circles_human_avatars_distinct— Human-filtered projection ofapi_execution_circles_v2_avatar_metadata
Everything else in models/ predates the semantic-layer work.
semantic/authoring/<module>/semantic_models.yml¶
One file per module. Two top-level blocks:
semantic_models:
- name: <model_or_alias>
model: ref('<dbt_model>')
entities:
- name: user_pseudonym # for cross-sector joins
type: primary
expr: toString(user_pseudonym)
dimensions:
- name: week
type: time
expr: week # may include derivations: addDays(week, 1) etc.
type_params:
time_granularity: week
- name: protocol
type: categorical
expr: label # the underlying SQL column
measures:
- name: lending_deposits_volume_value # MUST be globally unique
agg: sum # see `_AGG_TO_CLICKHOUSE` in cerebro-mcp
expr: deposits_volume_weekly
config:
meta:
cerebro:
owner: analytics_team
quality_tier: approved # approved / candidate / blocked
grain: week
question_synonyms:
- lending deposits weekly
- aave deposits
metrics:
- name: lending_deposits_volume_weekly # the metric ID callers use
label: Lending Deposits Volume (Weekly)
description: >
Sum of weekly deposit volume per (protocol, token).
type: simple
type_params:
measure: lending_deposits_volume_value
config:
meta:
cerebro:
quality_tier: approved
grain: week
owner: analytics_team
allowed_dimensions: [week, symbol, token_class, protocol]
supported_time_grains: [week, month]
question_synonyms:
- lending deposits weekly
- aave deposits weekly
A few invariants the build enforces (see maintenance):
- Measure names are globally unique. Two
value_valuemeasures in two differentsemantic_modelsare anambiguous_measure_bindingerror. - Metrics resolve deterministically.
build_metrics()keysmeasure_to_modelsas a sorted-first dict so the registry is reproducible even when an ambiguity exists. - A metric's root model must be
semantic_status: approvedfor the metric to be executable. Common gotcha: you add a candidate semantic_model and an approved metric pointing at it — the metric appears indiscover_metricsbutquery_metricsrejects it.
semantic/relationships/*.yml¶
Declare cross-sector joins. The MCP planner reads these to find paths between metric roots. Three kinds in production today:
| File | Axis | Purpose |
|---|---|---|
time_spines.yml | day / week / month | Cross-grain composition. Every weekly mart joins to dim_time_spine_weekly. |
user_pseudonym.yml | user_pseudonym | Cross-sector user-overlap. 4-node graph (revenue × gpay × gnosis_app × circles). |
execution_graph.yml | circles_avatar, safe, validator, ... | Entity-specific joins (Circles trust graph, GP wallet ↔ Safe owner, validator withdrawal address ↔ Safe). |
Relationship shape:
relationships:
- name: revenue_per_user_weekly_to_gpay_identity
left_model: fct_revenue_per_user_weekly
right_model: fct_execution_gpay_users_distinct
left_keys: [user_pseudonym]
right_keys: [user_pseudonym]
cardinality: many_to_many
join_semantics: inner
via_entity: user_pseudonym
preferred_bridge: true
safe_for_dimension_enrichment: true
aggregate_then_join_only: false
allow_any_join: false
quality_tier: approved
The fields the planner actually uses:
left_model/right_model/left_keys/right_keys— join condition.via_entity— the entity axis. Edges sharing avia_entityform a reachability graph.cardinality+join_semantics— affects which join type the planner emits (INNERvsLEFT).preferred_bridge— used for cost weighting infind_safest_path(cerebro-mcpsemantic_graph.py).quality_tier— onlyapprovedrelationships are eligible for the planner's path search. Lets you stage new joins ascandidate.
target/semantic_registry.json¶
The build artifact. Schema:
{
"metadata": { "manifest_hash": "...", "catalog_hash": "..." },
"models": {
"<model_name>": {
"name": "...",
"semantic_status": "approved" | "candidate" | "docs_only",
"dimensions": [...],
"measures": [...],
"module": "execution",
"tags": [...]
}
},
"metrics": {
"<metric_name>": {
"measure": "...",
"root_model": "...",
"quality_tier": "approved" | "candidate" | "blocked",
"semantic_status": "approved" | "candidate",
"allowed_dimensions": [...],
"supported_time_grains": [...],
"question_synonyms": [...]
}
},
"relationships": [...],
"coverage_summary": {...}
}
The MCP server reads this whole file at startup and on each refresh (default poll: 300s; force via reload_semantic_registry).
Build flow¶
┌──────────────────────────────────────┐
│ Author writes / edits files in: │
│ models/**/*.sql │
│ models/**/schema.yml │
│ semantic/authoring/**/*.yml │
│ semantic/relationships/**/*.yml │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ dbt run / dbt compile │
│ → target/manifest.json │
│ → target/catalog.json │
│ → target/semantic_manifest.json │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ python3 scripts/semantic/ │
│ build_registry.py │
│ → target/semantic_registry.json │
│ → target/semantic_validation_ │
│ report.json │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Publish to GitHub Pages │
│ (https://gnosischain.github.io/ │
│ dbt-cerebro/semantic_registry.json)│
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ cerebro-mcp polls every 300s │
│ (or force-refresh via │
│ reload_semantic_registry tool) │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ AI agent / dashboard / analyst: │
│ discover_metrics → query_metrics │
└──────────────────────────────────────┘
Validation pass¶
build_registry.py --validate runs all of these checks. Each maps to a specific error code in target/semantic_validation_report.json:
| Error code | Cause | Fix |
|---|---|---|
ambiguous_measure_binding | Two semantic_models declare a measure with the same name; the metric pointing at it is non-deterministic. | Rename the measure on the intended source to be unique (convention: <metric_name>_value). |
missing_measure | A metric references a measure name that no semantic_model declares. | Typo or stale reference. Either remove the metric or add the measure. |
metric_missing_root_model | The measure binding produced no root_model. Almost always a downstream effect of missing_measure or ambiguous_measure_binding. | Fix the underlying measure binding. |
unknown_left_model / unknown_right_model | A relationship references a dbt model name not in target/manifest.json. | Either build the missing model or fix the typo in the relationship file. |
approved_model_missing_measures | A semantic_model tagged quality_tier: approved has no measures: block. | Either add measures or demote to candidate. |
allow_any_join_not_approved (warning) | An allow_any_join: true relationship is not approved. Joins that collapse duplicates need explicit approval. | Either set quality_tier: approved and review the join, or set allow_any_join: false. |
The validation is non-fatal by default; CI should run it with --validate and fail the build on errors.
Where the planner code lives¶
Bug fixes and feature work for the planner happen in the cerebro-mcp repo (separate from dbt-cerebro). Key files:
src/cerebro_mcp/semantic_sql_compiler.py— emits ClickHouse SQL from the metric plan. The_AGG_TO_CLICKHOUSEmap lives here (translatescount_distinct→uniqExact, etc.).src/cerebro_mcp/semantic_planner.py— resolves dimension bindings, including the time-spine upcast (_try_time_spine_upcast).src/cerebro_mcp/semantic_graph.py— builds the reachability graph from relationships and runsfind_safest_path.src/cerebro_mcp/semantic_loader.py— registry refresh, ETag-based poll, and the force-reload path used byreload_semantic_registry.src/cerebro_mcp/tools/semantic.py— the MCP-tool surface (discover_metrics,query_metrics,explain_metric_query,reload_semantic_registry).
Changes to the planner or the SQL compiler should land as small PRs with unit-test coverage (see tests/test_semantic_*.py in cerebro-mcp).