Measurement Stack (MTA + MMM)¶
This page is the implementation reference for the 24 dbt models that back Multi-Touch Attribution and Marketing Mix Modeling on Gnosis sectors. For the conceptual framing of MTA see the MTA Foundations page; for MMM see the MMM Foundations page.
Why this is a dbt stack and not a semantic-layer query¶
The Cerebro semantic layer covers most analytics with composable metric / dimension definitions. The measurement stack lives in dbt because each layer breaks an assumption the semantic layer makes:
- The pseudonym salt is build-time only.
pseudonymize_addressreadsCEREBRO_PII_SALTfrom the dbt environment. The salt is never exposed at query time, so a hash from a raw address can only be produced inside a dbt run. Identity bridges therefore cannot be expressed as a semantic metric — they must be materialised tables. - Journey spines are relational, not aggregate. The semantic layer composes aggregate measures over dimensional cuts. A journey is a per-conversion ordered list of prior touchpoints — a JOIN, not an aggregate. Materialising the JOIN once (with leakage guards baked in) saves every downstream query from re-doing it.
- The MMM weekly spine needs continuous-time fill. The regression pipeline must see every week even when no upstream source emitted a row. The dbt intermediates use the
weekly_spinemacro plus registry-drivenLEFT JOINto manufacture the missing weeks. A semantic-layer query over the raw sources would silently truncate. - Per-row credit math depends on per-conversion aggregates. The attribution mart computes
td_raw / sum(td_raw) per conversionfor the time-decay rule and1 / count(*) per conversionfor linear. These are window-style aggregates that the v3 implementation rewrote as pre-aggregate-then-JOIN to fit the 10 GiB memory cap (see the attribution mart description below). - OOM constraints. The role fan-out × 30-day lookback for GP journeys OOM'd at the cluster's per-query memory cap on a single pass; the dbt model uses
meta.full_refresh.batch_months: 1soscripts/full_refresh/refresh.pyrebuilds it month-by-month. Semantic-layer queries can't express that.
The internal_only boundary¶
Two models in the stack are tagged internal_only + privacy:tier_internal and configured with meta.expose_to_mcp = false:
int_execution_gnosis_app_user_identity_bridgeint_execution_gpay_user_identity_bridge
These bridges are the only place in the warehouse where raw addresses live alongside pseudonyms. cerebro-mcp's discover_models filter hides them from agents, the INTERNAL_ONLY_TABLES runtime guard denies them from execute_query, and cerebro-api never registers a route for them (no api:<resource> tag). Every other model in the stack reads from the bridge once at build time and keys on user_pseudonym thereafter.
See Privacy & Pseudonyms for the full pseudonymization approach.
The 24-model layout¶
Gnosis App (10 models)¶
| Layer | Model | Purpose |
|---|---|---|
| Identity | int_execution_gnosis_app_user_identity_bridge | Raw address ↔ user_pseudonym pairing. Internal only. |
| Touchpoints | int_execution_gnosis_app_events_chain_unified | Long-form chain event log keyed by pseudonym. UNIONs onboard, six heuristic kinds, swap_signed, swap_filled, topup, marketplace_buy, token_offer_claim. |
| Touchpoints | int_execution_gnosis_app_events_mixpanel_unified | Long-form Mixpanel event log (existing model, referenced for completeness). |
| Touchpoints | int_execution_gnosis_app_user_events_unified | UNION ALL of chain + Mixpanel — the touchpoint table the MTA persona reads. |
| Conversions | int_execution_gnosis_app_conversions | One row per conversion with conversion_kind as a column (topup / swap_filled / token_offer_claim / marketplace_buy). |
| Diagnostics | int_execution_gnosis_app_coverage_daily | Per-(date, kind) tracked-coverage stats with the leakage guard pre-applied. |
| Marts | fct_execution_gnosis_app_journeys_{7,30,60}d | Pre-joined journey spine via build_journey_lookback(N, 'gnosis_app'). The 30d cut is the default; 7d and 60d are sensitivity-sweep variants emitted from the same macro. |
| Marts | fct_execution_gnosis_app_funnel_daily | windowFunnel diagnostics driven by seed mta_funnels.csv. |
| Marts | fct_execution_gnosis_app_attribution_{7,30,60}d | First / last / linear / time-decay credit per (kind, event_kind) over the 180-day window. Time-decay half-life is fixed at 7 days regardless of lookback (it is a customer-decay parameter, not a window parameter). |
| API | api_execution_gnosis_app_attribution_{7,30,60}d | View passthrough, Tier1, granularity:rolling_180d. |
Gnosis Pay (7 models)¶
The GP stack mirrors the GA stack with one extra column on every model: identity_role (LowCardinality, values safe_self / initial_owner / delegate).
| Layer | Model | Purpose |
|---|---|---|
| Identity | int_execution_gpay_user_identity_bridge | Raw address × identity_role ↔ user_pseudonym pairing. Internal only. A 2-owner Safe with 1 delegate yields 4 rows. |
| Touchpoints | int_execution_gpay_user_events_unified | Chain card / safe activity, role-fanned-out. event_kind ∈ gp.payment / gp.cashback_claim / gp.deposit / gp.withdrawal / gp.action_other. |
| Conversions | int_execution_gpay_conversions | gpay_payment / gpay_cashback_claim / gpay_funded (first-ever inflow, computed across full history). |
| Diagnostics | int_execution_gpay_coverage_daily | Per-(date, kind, role) tracked coverage. |
| Marts | fct_execution_gpay_journeys_{7,30,60}d | build_journey_lookback(N, 'gpay') — adds identity_role to the JOIN. Microbatched in 1-month chunks (3-role × Nd lookback OOM'd single-pass on the 10 GiB cluster cap). The 60d variant is the heaviest of the three: monthly batches occasionally push the 10 GiB cap and refresh.py auto-retries the rare batch that does. |
| Marts | fct_execution_gpay_attribution_{7,30,60}d | First / last / linear / time-decay per (kind, role, event_kind). |
| API | api_execution_gpay_attribution_{7,30,60}d | View passthrough, Tier1. |
MMM (7 models)¶
| Layer | Model | Purpose |
|---|---|---|
| Intermediate | int_execution_mmm_kpis_weekly | Long-form weekly KPI registry. Continuous spine via weekly_spine macro. |
| Intermediate | int_execution_mmm_media_weekly | Long-form weekly media (incentive / reward / outlay) registry. |
| Intermediate | int_execution_mmm_controls_weekly | Long-form weekly controls (gas, prices, week-of-year, holiday, hardfork step). |
| Marts | fct_execution_mmm_spine_weekly | Wide-pivot spine — one row per week, one column per (kpi | media | control). The MMM analyst's regression input. |
| Marts | fct_execution_mmm_baseline_latest | Per-(KPI, media) baseline median over weeks where the media's adstocked spend (λ=0.5) is in the bottom decile. |
| Marts | fct_execution_mmm_collinearity_latest | Pairwise Pearson correlation matrix between media columns over the trailing 730 days. |
| API | api_execution_mmm_spine_weekly | View passthrough, Tier1, granularity:weekly. |
Macros that bind the stack¶
pseudonymize_address(expr)— keyedsipHash64over the salted, lowercased address. Deterministic; the salt comes fromCEREBRO_PII_SALT. Same macro is used at Mixpanel ingestion so the hashes line up.build_journey_lookback(lookback_days, sector)— generates the body of the per-sector journey spine mart. GP variant adds anidentity_roleequality to the conversion ↔ events join. Used by the live 30d marts and any future 7d / 14d / 60d sensitivity-sweep variants.conversion_kind_to_event_kind(expr)—multiIfmapping eachconversion_kindto its corresponding event_kind. Used by the journey spine and the coverage_daily models to apply the self-exclusion leakage guard. Mirrors seedsmta_conversion_to_event_kind.csvandmta_gp_conversion_to_event_kind.csv— when a new conversion kind is added, the macro and the seed must both be updated.weekly_spine(start, end)— emits a continuous Monday-start week spine. Backs every MMM intermediate so missing-source-week rows are manufactured rather than truncated.apply_monthly_incremental_filter(...)— standard incremental watermark pattern; not new but used pervasively here.
Seeds that drive the stack¶
| Seed | Drives |
|---|---|
mta_event_kinds.csv | Validates GA event_kind values (relationship test). |
mta_conversion_kinds.csv | Validates GA conversion_kind values. |
mta_conversion_to_event_kind.csv | Documents the GA self-exclusion mapping. |
mta_gp_event_kinds.csv | Validates GP event_kind values. |
mta_gp_conversion_kinds.csv | Validates GP conversion_kind values. |
mta_gp_identity_roles.csv | Validates identity_role values (safe_self / initial_owner / delegate). |
mta_gp_conversion_to_event_kind.csv | Documents the GP self-exclusion mapping. |
mta_funnels.csv | One row per funnel; drives fct_execution_gnosis_app_funnel_daily. |
mmm_kpi_registry.csv | One row per KPI — name, units, value method, source model, is_dedup_safe. Drives the weekly spine fill. |
mmm_media_registry.csv | One row per media — name, units, is_outlay, value method, source model. |
mmm_control_registry.csv | One row per control — name, value method, source model. |
mmm_holiday_weeks.csv | Hand-curated holiday weeks (drives ctrl_is_holiday_week). |
mmm_hardfork_steps.csv | Hardfork dates (drives ctrl_hardfork_step). |
OOM mitigations worth knowing about¶
- GP journeys 30d. The 3-role fan-out × all GP activity × 30-day lookback OOM'd at the 10 GiB cluster cap on a single 2-year full-refresh. The model is configured with
meta.full_refresh.batch_months: 1;scripts/full_refresh/refresh.pydrives the per-month batches and auto-retries the rare batch that pushes the limit. - Attribution mart v3 rewrite. The first attempt used
groupArray + arraySort + ARRAY JOIN; the second used window functions partitioned by (user, conversion). Both OOM'd because the partition key forced a 12 M-row sort. The current implementation pre-aggregates per conversion (~22 k rows), then JOINs back to score per touch. Sort space drops from 12 M to ~22 k. Pure GROUP BY + LEFT JOIN, no window functions, no array materialisation. - DEX volume dedup. A per-tx dedup CTE on
int_execution_pools_dex_tradesOOM'd at the 10 GiB cap. Thedex_volume_usd_dedupKPI accepts the known multi-hop overcount (first-hop-only) and surfacesis_dedup_safe = falseon every row. Downstream consumers (and the analyst persona) see the flag and caveat any conclusions. - Grace-hash join algorithm. Models that JOIN the bridge to wide upstreams set
pre_hook=["SET join_algorithm = 'grace_hash'"]and reset inpost_hookto keep peak memory bounded across the UNIONs.
Cross-references¶
- Privacy & Pseudonyms — the keyed-hash design.
- Incremental Strategies —
delete+insertvsappend, howstart_month/end_monthvars drive batched runs. - MTA Foundations — the conceptual layer.
- MMM Foundations — the conceptual layer.
- Measurement Flow (MCP) — how agents consume these models.