Hybrid Search¶

How search_models, discover_models, and get_relevant_columns rank dbt models and columns.

What it is¶

Cerebro replaces naive token-overlap search with a hybrid ranker:

BM25 over an enriched per-model blob (name + description + tags + owner + column names + column descriptions + path tokens + meta.inference_notes).
Token-overlap legacy ranker (substring counting, hand-tuned tie-breakers).
Reciprocal Rank Fusion (RRF) combines both rankings: items ranked highly by both rise to the top.

A separate Column BM25 index ranks columns within a single model — this is what powers column-scoped schema injection on wide tables.

A networkx DAG of model + source nodes provides deterministic lineage: get_upstream_lineage, get_downstream_impact.

All deterministic. No embeddings, no model loading. Builds in <100ms on the live ~862-model manifest.

Every search_models call uses this transparently.
get_relevant_columns(model, query, top_k) ranks columns by BM25 against the query and always includes join keys + time grains.
Column-scoped schema injection (schema_context.py) keeps prompts compact on wide tables (100+ columns) without losing the columns the SQL actually needs.

On the live manifest (862 models / 75 sources / 5,136 column docs):

Metric	Hybrid	Legacy	Δ
`search_models` median latency	2.3 ms	1.4 ms	+0.9 ms
`hit@1` (12-query eval)	4/12 (33%)	1/12 (8%)	+25 pp
`hit@3`	9/12 (75%)	5/12 (42%)	+33 pp
`hit@5`	11/12 (92%)	10/12 (83%)	+9 pp
Lineage query latency	14 µs median	n/a	—
Memory total	~7 MB	~4 MB	+3 MB

The +0.9 ms search overhead is well below typical MCP serialization overhead. hit@1 improved 4× absolute.

Distinctive name tokens dominate. BM25 weights rare tokens (high IDF) — a model whose name contains the query terms gets a strong signal that token-overlap counts the same as any other match.
Column descriptions add direct query→data signal. A model with a column effective_balance matches "validator balance" even if its description is generic.

schema_context.py produces a markdown schema block per model:

Tables narrower than SQL_COMPILER_FULL_SCHEMA_THRESHOLD (default 30) → inject every column.
Wider tables → BM25-rank columns against the query, keep top-K + an allowlist of join keys / partition columns / time grains (_ALWAYS_KEEP_NAMES).
Anaemic floor: if BM25 + always-keep < top_k, pad with the first K columns so an off-topic query against a wide staging table still gets a usable schema.
The block ends with a comment naming the omitted columns and how to request them via get_relevant_columns.

Variable	Default	Description
`SQL_COMPILER_FULL_SCHEMA_THRESHOLD`	30	Below this column count, full schema is injected
`SQL_COMPILER_TOP_COLUMNS`	20	Cap for wide-table scoping

Tool	What hybrid search does
`search_models`	Returns BM25+overlap fused ranking
`discover_models`	Same, with filters applied first
`get_relevant_columns`	Column BM25, plus join-key allowlist
`get_model_details`	No ranking; uses lineage walk
`get_upstream_lineage` / `get_downstream_impact`	networkx ancestors / descendants
`preflight_analytics_request`	Search + lineage to gate the request

dex pool fees — column descriptions mentioning "fees" or "pool" promote unrelated models. Classic length-norm vs recall trade-off; deferred fix is multi-field BM25.
*_latest vs *_dist_* snapshot models competing with _daily time-series models. Deferred fix is a name-pattern boost layer.

Query for what you want to know, not what you think the table is called. "validator withdrawals last 30d" beats "consensus_layer_withdrawals_table".
Don't stop at first match. The dispatcher names model tiers but the catalog has more — see Quality Gates.
Use get_relevant_columns(model, query) for wide tables before writing SQL — keeps the prompt compact without losing the right columns.