Maintenance & invariants¶
The semantic-layer registry is a stable contract that downstream consumers (AI agents, dashboards, BI tools) rely on. Keeping that contract honest requires disciplined authoring and an understanding of the invariants the build enforces — and the ones it doesn't.
This page is the operational playbook.
The five invariants¶
These hold across the project. Breaking any one is a bug.
1. Measure names are globally unique¶
Every name: under a measures: block must be unique across the entire semantic/authoring/ tree. The build collects all measure names into a measure_name → [model_names] map; when a metric's type_params.measure resolves to two or more models, the binding is ambiguous and the validator emits ambiguous_measure_binding.
Convention: prefix the measure with the metric name. A metric cow_volume_usd should be backed by a measure execution_cow_volume_usd_value (or similar — what matters is uniqueness).
The build picks a deterministic winner (sorted(candidates)[0]) so the registry is reproducible even when collisions exist. But the validator will flag every collision as an error if you run with --validate.
2. Root semantic_model's quality_tier matches the metric's¶
If a metric is approved but its root semantic_model is candidate, the metric appears in the registry but fails at execution time: _metric_is_executable checks both. Common gotcha when authors promote a metric without realising the root model also needs promotion.
Promotion sequence: 1. Set the semantic_model's config.meta.cerebro.quality_tier: approved. 2. Set every metric pointing at it to approved as well. 3. Validate that target/semantic_registry.json shows both as approved and the metric resolves to an approved root.
3. Monday-anchored weeks everywhere¶
dim_time_spine_weekly uses toMonday(day). Every weekly mart must use toStartOfWeek(date, 1) or toMonday(date) to match. Bare toStartOfWeek(date) (default Sunday) silently produces misaligned data; cross-sector composition through dim_time_spine_weekly will join the wrong 7-day windows.
This invariant is not yet enforced by validation. Adding a check that scans every *_weekly.sql for toStartOfWeek(*, *) without , 1 is on the open-improvements list.
4. user_pseudonym hash space is project-wide¶
The pseudonymize_address macro hashes lowercased addresses with a secret CEREBRO_PII_SALT. Every model that joins on user_pseudonym implicitly trusts that this salt has been stable forever. Rotating the salt is theoretically possible but practically catastrophic: every materialised pseudonym in every mart would need a full refresh, and every relationship in the user-pseudonym graph would point at mismatched IDs in the interim.
Authoring rules: - Never call sipHash64(...) directly on an address — always go through {{ pseudonymize_address('col') }}. - Never store raw addresses in a model that also stores user_pseudonym. (Use the identity-bridge pattern: raw + pseudonym in an internal-only mart, with a pseudonym-only mart-tier projection.) - See Privacy & Pseudonyms.
5. Relationships only reference materialised models¶
A relationship in semantic/relationships/*.yml is a CI-checked claim that both left_model and right_model exist in target/manifest.json. Renaming a dbt model without updating its relationship references emits unknown_left_model / unknown_right_model errors.
This is why relationships are kept in semantic/relationships/ rather than scattered inside individual semantic_models.yml blocks — they need a holistic refresh whenever models are renamed.
Privacy gates¶
The semantic-layer quality_tier is NOT the same as the cerebro-api exclude_from_api flag or the MCP expose_to_mcp flag. These are three independent gates:
| Gate | Where | What it blocks |
|---|---|---|
quality_tier: blocked | semantic_models.yml | The metric never enters the registry. Both discover_metrics and query_metrics can't reach it. |
meta.expose_to_mcp: false | models/**/schema.yml (or dbt_project.yml) | The MCP execute_query tool refuses to query this model directly. Raw SQL access is blocked. |
meta.api.exclude_from_api: true | models/**/schema.yml (or dbt_project.yml) | cerebro-api refuses to expose the model as a REST endpoint. |
The Mixpanel privacy policy uses all three together:
- All Mixpanel marts:
meta.api.exclude_from_api: true— no cerebro-api exposure. - Per-user grain (
api_mixpanel_ga_users_daily): alsometa.expose_to_mcp: falseAND not registered insemantic/authoring/mixpanel_ga/semantic_models.yml— completely invisible to MCP consumers. - Aggregate Mixpanel views (overview, modals, funnel, etc.): MCP-accessible, registered as
quality_tier: approvedfor the few promoted entries.
See Privacy & Pseudonyms for the full policy.
Authoring checklist¶
When adding a new metric:
- dbt model exists and is materialised.
dbt build --select <model>passes. - Schema doc in
models/<module>/marts/schema.ymlcovers every column the metric will surface. - Semantic_model entry in
semantic/authoring/<module>/semantic_models.yml:-
model: ref('<the_dbt_model>')matches an actual model - Dimensions enumerated (every column the metric exposes for grouping or filtering)
- Measures have globally-unique names (convention:
<metric_name>_value) -
config.meta.cerebro.quality_tiermatches the intended metric tier (or iscandidateif you're still iterating) -
question_synonymspopulated — these drivediscover_metrics
-
- Metric entry in the same file's
metrics:block:-
type_params.measurematches one of the measure names you declared -
allowed_dimensionsenumerates every dimension a caller may pass -
supported_time_grainslists the natural time grains (day,week,monthtypically) -
quality_tiermatches the root semantic_model's tier
-
- If cross-sector: relationship declared in
semantic/relationships/*.yml(user_pseudonym.yml,time_spines.yml, orexecution_graph.yml) - Build + validate: Zero
error_countfor the new metric. - Force-reload the MCP runtime cache:
- Sanity-check via
discover_metricsandquery_metrics.
Promotion checklist (candidate → approved)¶
- Underlying SQL is stable; no known data-quality issues open.
- Column types are stable (no in-flight migration).
- Cross-sector relationships referencing this metric's root are approved (
quality_tier: approved). - Real-world test query via
query_metricsreturns sensible data. - If this metric will be composed cross-grain: time-spine relationship in place.
- Documentation updated — at minimum a 1-line description on the metric, ideally a short paragraph on the dbt model's
descriptionfield. - Flip
quality_tier: candidate→approvedon both the semantic_model and the metric. Rebuild + reload. - Verify with
discover_metrics: the metric shows up. - Verify with
query_metrics: real data comes back (no "not approved" error).
Drift modes — what to watch for¶
| Drift mode | How it presents | Mitigation |
|---|---|---|
| Measure-name collision | New metric resolves to wrong root_model; discover_metrics returns confusing scores. | validate_registry catches at build time. Naming convention: <metric_name>_value. |
| Root semantic_model still candidate | Metric appears in discover_metrics but query_metrics rejects with "not approved". | Promotion checklist above — promote both. |
| Week-anchor outlier | Cross-sector joins return misaligned rows; counts look weirdly low. | Audit *_weekly.sql for bare toStartOfWeek(date). CI check pending. |
| Renamed dbt model | unknown_left_model validation error. | Update every semantic/relationships/*.yml that referenced the old name. |
| Stale registry on MCP side | query_metrics returns "metric not found" or pre-promotion state. | Call reload_semantic_registry. If still stale, check CDN cache headers on the GitHub Pages publish. |
| Salt rotation (hypothetical) | All user_pseudonym joins return empty. | Don't rotate. If absolutely necessary, full refresh of every pseudonymized mart in a single transaction. |
Ongoing-maintenance time budget¶
Realistic projection based on the current shape (50 metrics, 34 relationships, ~150 semantic_models, ~1000 dbt models):
- New cross-sector metric: 30 min (write YAML + validate + smoke test).
- New user-keyed mart: 1-2 h (write SQL + schema doc + semantic_model
- relationships + build/test cycle).
- Renaming a dbt model: 15-30 min depending on how many relationships reference it.
- Failed validation in CI: 5-15 min to fix per error code (most are YAML typos or stale references).
- MCP planner bug surfacing in production: variable; expect 1-4 h including PR review on cerebro-mcp.
Recurring time: ~1-2 hours / month if the team scopes properly — i.e. doesn't register every new mart, only the analyst-facing ones.
Tooling¶
| Tool | Purpose | Where |
|---|---|---|
scripts/semantic/build_registry.py | Compile semantic/authoring/ + dbt artifacts into target/semantic_registry.json. | dbt-cerebro |
scripts/semantic/build_registry.py --validate | Add invariant checks (measure uniqueness, missing measures, unknown relationship models, etc.). | dbt-cerebro |
scripts/semantic/build_semantic_docs.py | Generate semantic_docs_index.json for MCP's gnosis://semantic-model/{name} resource. | dbt-cerebro |
scripts/semantic/scaffold_candidates.py | Auto-generate candidate semantic_models from new dbt models (rough starting point only — review before committing). | dbt-cerebro |
scripts/semantic/generate_graph_diagram.py | Auto-generate the Mermaid graph in graph.md. | dbt-cerebro |
mcp__cerebro-dev__reload_semantic_registry | Force MCP runtime refresh, bypassing the 300s poll. | cerebro-mcp MCP tool |
tests/test_semantic_registry.py | pytest suite for build_registry.py + validation. | dbt-cerebro |
Open improvements (CI / tooling)¶
Tracked here so authoring debt stays visible:
- Week-anchor enforcement — scan every
*_weekly.sqlfortoStartOfWeek(*)without explicit, 1mode. Currently a convention; should be a build check. - Hash-space rotation guard — alert if
CEREBRO_PII_SALTis set to a different value than the one used by any existing pseudonymized mart. Currently nothing prevents an accidental rotation in a non-prod environment. - Auto-published
graph.md— wirescripts/semantic/generate_graph_diagram.pyinto the registry-build CI step so the graph stays current automatically. - Cross-grain enforcement — flag a metric in
discover_metricsresults when itssupported_time_grainsdeclare a grain that's not reachable from its root model (i.e. no spine bridge or upcast template exists). Currently surfaces as a runtime error. - Set-intersection metric type — planner enhancement to support
query_metrics([userX_metric, userY_metric])with no shared dimensions, returning intersection cardinalities. Today this is a raw-query pattern.