Maintenance & invariants¶

The semantic-layer registry is a stable contract that downstream consumers (AI agents, dashboards, BI tools) rely on. Keeping that contract honest requires disciplined authoring and an understanding of the invariants the build enforces — and the ones it doesn't.

This page is the operational playbook.

The five invariants¶

These hold across the project. Breaking any one is a bug.

1. Measure names are globally unique¶

Every name: under a measures: block must be unique across the entire semantic/authoring/ tree. The build collects all measure names into a measure_name → [model_names] map; when a metric's type_params.measure resolves to two or more models, the binding is ambiguous and the validator emits ambiguous_measure_binding.

Convention: prefix the measure with the metric name. A metric cow_volume_usd should be backed by a measure execution_cow_volume_usd_value (or similar — what matters is uniqueness).

The build picks a deterministic winner (sorted(candidates)[0]) so the registry is reproducible even when collisions exist. But the validator will flag every collision as an error if you run with --validate.

2. Root semantic_model's `quality_tier` matches the metric's¶

If a metric is approved but its root semantic_model is candidate, the metric appears in the registry but fails at execution time: _metric_is_executable checks both. Common gotcha when authors promote a metric without realising the root model also needs promotion.

Promotion sequence: 1. Set the semantic_model's config.meta.cerebro.quality_tier: approved. 2. Set every metric pointing at it to approved as well. 3. Validate that target/semantic_registry.json shows both as approved and the metric resolves to an approved root.

3. Monday-anchored weeks everywhere¶

dim_time_spine_weekly uses toMonday(day). Every weekly mart must use toStartOfWeek(date, 1) or toMonday(date) to match. Bare toStartOfWeek(date) (default Sunday) silently produces misaligned data; cross-sector composition through dim_time_spine_weekly will join the wrong 7-day windows.

This invariant is not yet enforced by validation. Adding a check that scans every *_weekly.sql for toStartOfWeek(*, *) without , 1 is on the open-improvements list.

4. `user_pseudonym` hash space is project-wide¶

The pseudonymize_address macro hashes lowercased addresses with a secret CEREBRO_PII_SALT. Every model that joins on user_pseudonym implicitly trusts that this salt has been stable forever. Rotating the salt is theoretically possible but practically catastrophic: every materialised pseudonym in every mart would need a full refresh, and every relationship in the user-pseudonym graph would point at mismatched IDs in the interim.

Authoring rules: - Never call sipHash64(...) directly on an address — always go through {{ pseudonymize_address('col') }}. - Never store raw addresses in a model that also stores user_pseudonym. (Use the identity-bridge pattern: raw + pseudonym in an internal-only mart, with a pseudonym-only mart-tier projection.) - See Privacy & Pseudonyms.

5. Relationships only reference materialised models¶

A relationship in semantic/relationships/*.yml is a CI-checked claim that both left_model and right_model exist in target/manifest.json. Renaming a dbt model without updating its relationship references emits unknown_left_model / unknown_right_model errors.

This is why relationships are kept in semantic/relationships/ rather than scattered inside individual semantic_models.yml blocks — they need a holistic refresh whenever models are renamed.

Privacy gates¶

The semantic-layer quality_tier is NOT the same as the cerebro-api exclude_from_api flag or the MCP expose_to_mcp flag. These are three independent gates:

Gate	Where	What it blocks
`quality_tier: blocked`	`semantic_models.yml`	The metric never enters the registry. Both `discover_metrics` and `query_metrics` can't reach it.
`meta.expose_to_mcp: false`	`models/**/schema.yml` (or `dbt_project.yml`)	The MCP `execute_query` tool refuses to query this model directly. Raw SQL access is blocked.
`meta.api.exclude_from_api: true`	`models/**/schema.yml` (or `dbt_project.yml`)	cerebro-api refuses to expose the model as a REST endpoint.

The Mixpanel privacy policy uses all three together:

All Mixpanel marts: meta.api.exclude_from_api: true — no cerebro-api exposure.
Per-user grain (api_mixpanel_ga_users_daily): also meta.expose_to_mcp: false AND not registered in semantic/authoring/mixpanel_ga/semantic_models.yml — completely invisible to MCP consumers.
Aggregate Mixpanel views (overview, modals, funnel, etc.): MCP-accessible, registered as quality_tier: approved for the few promoted entries.

See Privacy & Pseudonyms for the full policy.

Authoring checklist¶

When adding a new metric:

dbt model exists and is materialised. dbt build --select <model> passes.
Schema doc in models/<module>/marts/schema.yml covers every column the metric will surface.
Semantic_model entry in semantic/authoring/<module>/semantic_models.yml:
- model: ref('<the_dbt_model>') matches an actual model
- Dimensions enumerated (every column the metric exposes for grouping or filtering)
- Measures have globally-unique names (convention: <metric_name>_value)
- config.meta.cerebro.quality_tier matches the intended metric tier (or is candidate if you're still iterating)
- question_synonyms populated — these drive discover_metrics
Metric entry in the same file's metrics: block:
- type_params.measure matches one of the measure names you declared
- allowed_dimensions enumerates every dimension a caller may pass
- supported_time_grains lists the natural time grains (day, week, month typically)
- quality_tier matches the root semantic_model's tier
If cross-sector: relationship declared in semantic/relationships/*.yml (user_pseudonym.yml, time_spines.yml, or execution_graph.yml)

Build + validate:

python3 scripts/semantic/build_registry.py --target-dir target --validate

Zero error_count for the new metric.

Force-reload the MCP runtime cache:

mcp__cerebro-dev__reload_semantic_registry()

Sanity-check via discover_metrics and query_metrics.

Promotion checklist (candidate → approved)¶

Underlying SQL is stable; no known data-quality issues open.
Column types are stable (no in-flight migration).
Cross-sector relationships referencing this metric's root are approved (quality_tier: approved).
Real-world test query via query_metrics returns sensible data.
If this metric will be composed cross-grain: time-spine relationship in place.
Documentation updated — at minimum a 1-line description on the metric, ideally a short paragraph on the dbt model's description field.
Flip quality_tier: candidate → approved on both the semantic_model and the metric. Rebuild + reload.
Verify with discover_metrics: the metric shows up.
Verify with query_metrics: real data comes back (no "not approved" error).

Drift modes — what to watch for¶

Drift mode	How it presents	Mitigation
Measure-name collision	New metric resolves to wrong root_model; `discover_metrics` returns confusing scores.	`validate_registry` catches at build time. Naming convention: `<metric_name>_value`.
Root semantic_model still candidate	Metric appears in `discover_metrics` but `query_metrics` rejects with "not approved".	Promotion checklist above — promote both.
Week-anchor outlier	Cross-sector joins return misaligned rows; counts look weirdly low.	Audit `*_weekly.sql` for bare `toStartOfWeek(date)`. CI check pending.
Renamed dbt model	`unknown_left_model` validation error.	Update every `semantic/relationships/*.yml` that referenced the old name.
Stale registry on MCP side	`query_metrics` returns "metric not found" or pre-promotion state.	Call `reload_semantic_registry`. If still stale, check CDN cache headers on the GitHub Pages publish.
Salt rotation (hypothetical)	All `user_pseudonym` joins return empty.	Don't rotate. If absolutely necessary, full refresh of every pseudonymized mart in a single transaction.

Ongoing-maintenance time budget¶

Realistic projection based on the current shape (50 metrics, 34 relationships, ~150 semantic_models, ~1000 dbt models):

New cross-sector metric: 30 min (write YAML + validate + smoke test).
New user-keyed mart: 1-2 h (write SQL + schema doc + semantic_model
relationships + build/test cycle).
Renaming a dbt model: 15-30 min depending on how many relationships reference it.
Failed validation in CI: 5-15 min to fix per error code (most are YAML typos or stale references).
MCP planner bug surfacing in production: variable; expect 1-4 h including PR review on cerebro-mcp.

Recurring time: ~1-2 hours / month if the team scopes properly — i.e. doesn't register every new mart, only the analyst-facing ones.

Tooling¶

Tool	Purpose	Where
`scripts/semantic/build_registry.py`	Compile `semantic/authoring/` + dbt artifacts into `target/semantic_registry.json`.	dbt-cerebro
`scripts/semantic/build_registry.py --validate`	Add invariant checks (measure uniqueness, missing measures, unknown relationship models, etc.).	dbt-cerebro
`scripts/semantic/build_semantic_docs.py`	Generate `semantic_docs_index.json` for MCP's `gnosis://semantic-model/{name}` resource.	dbt-cerebro
`scripts/semantic/scaffold_candidates.py`	Auto-generate candidate `semantic_models` from new dbt models (rough starting point only — review before committing).	dbt-cerebro
`scripts/semantic/generate_graph_diagram.py`	Auto-generate the Mermaid graph in graph.md.	dbt-cerebro
`mcp__cerebro-dev__reload_semantic_registry`	Force MCP runtime refresh, bypassing the 300s poll.	cerebro-mcp MCP tool
`tests/test_semantic_registry.py`	pytest suite for `build_registry.py` + validation.	dbt-cerebro

Open improvements (CI / tooling)¶

Tracked here so authoring debt stays visible:

Week-anchor enforcement — scan every *_weekly.sql for toStartOfWeek(*) without explicit , 1 mode. Currently a convention; should be a build check.
Hash-space rotation guard — alert if CEREBRO_PII_SALT is set to a different value than the one used by any existing pseudonymized mart. Currently nothing prevents an accidental rotation in a non-prod environment.
Auto-published graph.md — wire scripts/semantic/generate_graph_diagram.py into the registry-build CI step so the graph stays current automatically.
Cross-grain enforcement — flag a metric in discover_metrics results when its supported_time_grains declare a grain that's not reachable from its root model (i.e. no spine bridge or upcast template exists). Currently surfaces as a runtime error.
Set-intersection metric type — planner enhancement to support query_metrics([userX_metric, userY_metric]) with no shared dimensions, returning intersection cardinalities. Today this is a raw-query pattern.