Hybrid Search¶
How search_models, discover_models, and get_relevant_columns rank dbt models and columns.
What it is¶
Cerebro replaces naive token-overlap search with a hybrid ranker:
- BM25 over an enriched per-model blob (name + description + tags + owner + column names + column descriptions + path tokens +
meta.inference_notes). - Token-overlap legacy ranker (substring counting, hand-tuned tie-breakers).
- Reciprocal Rank Fusion (RRF) combines both rankings: items ranked highly by both rise to the top.
A separate Column BM25 index ranks columns within a single model — this is what powers column-scoped schema injection on wide tables.
A networkx DAG of model + source nodes provides deterministic lineage: get_upstream_lineage, get_downstream_impact.
All deterministic. No embeddings, no model loading. Builds in <100ms on the live ~862-model manifest.
When it matters¶
- Every
search_modelscall uses this transparently. get_relevant_columns(model, query, top_k)ranks columns by BM25 against the query and always includes join keys + time grains.- Column-scoped schema injection (
schema_context.py) keeps prompts compact on wide tables (100+ columns) without losing the columns the SQL actually needs.
Measured outcomes¶
On the live manifest (862 models / 75 sources / 5,136 column docs):
| Metric | Hybrid | Legacy | Δ |
|---|---|---|---|
search_models median latency | 2.3 ms | 1.4 ms | +0.9 ms |
hit@1 (12-query eval) | 4/12 (33%) | 1/12 (8%) | +25 pp |
hit@3 | 9/12 (75%) | 5/12 (42%) | +33 pp |
hit@5 | 11/12 (92%) | 10/12 (83%) | +9 pp |
| Lineage query latency | 14 µs median | n/a | — |
| Memory total | ~7 MB | ~4 MB | +3 MB |
The +0.9 ms search overhead is well below typical MCP serialization overhead. hit@1 improved 4× absolute.
Why it improved¶
- Distinctive name tokens dominate. BM25 weights rare tokens (high IDF) — a model whose name contains the query terms gets a strong signal that token-overlap counts the same as any other match.
- Column descriptions add direct query→data signal. A model with a column
effective_balancematches"validator balance"even if its description is generic.
Column-scoped schema injection¶
schema_context.py produces a markdown schema block per model:
- Tables narrower than
SQL_COMPILER_FULL_SCHEMA_THRESHOLD(default 30) → inject every column. - Wider tables → BM25-rank columns against the query, keep top-K + an allowlist of join keys / partition columns / time grains (
_ALWAYS_KEEP_NAMES). - Anaemic floor: if BM25 + always-keep <
top_k, pad with the first K columns so an off-topic query against a wide staging table still gets a usable schema. - The block ends with a comment naming the omitted columns and how to request them via
get_relevant_columns.
Configuration¶
| Variable | Default | Description |
|---|---|---|
SQL_COMPILER_FULL_SCHEMA_THRESHOLD | 30 | Below this column count, full schema is injected |
SQL_COMPILER_TOP_COLUMNS | 20 | Cap for wide-table scoping |
Tools that use this transparently¶
| Tool | What hybrid search does |
|---|---|
search_models | Returns BM25+overlap fused ranking |
discover_models | Same, with filters applied first |
get_relevant_columns | Column BM25, plus join-key allowlist |
get_model_details | No ranking; uses lineage walk |
get_upstream_lineage / get_downstream_impact | networkx ancestors / descendants |
preflight_analytics_request | Search + lineage to gate the request |
Known residual failure modes¶
dex pool fees— column descriptions mentioning "fees" or "pool" promote unrelated models. Classic length-norm vs recall trade-off; deferred fix is multi-field BM25.*_latestvs*_dist_*snapshot models competing with_dailytime-series models. Deferred fix is a name-pattern boost layer.
Best practices¶
- Query for what you want to know, not what you think the table is called. "validator withdrawals last 30d" beats "consensus_layer_withdrawals_table".
- Don't stop at first match. The dispatcher names model tiers but the catalog has more — see Quality Gates.
- Use
get_relevant_columns(model, query)for wide tables before writing SQL — keeps the prompt compact without losing the right columns.
See also¶
- Tools §1
- Quality Gates —
discovered_model_coverageenforces that you actually used the search results - Phase 1 design doc (cerebro-mcp repo) — full sprint write-up