dbt-schema-gen¶
dbt-schema-gen is an LLM-powered tool that generates and maintains schema.yml files for dbt models. It reads SQL model files, infers column metadata, and produces rich documentation with descriptions, data types, and tests.
Purpose¶
Maintaining schema.yml files manually is tedious and often neglected. dbt-schema-gen automates this by:
- Parsing SQL model files to extract column names, config blocks, and references
- Using an LLM to generate meaningful descriptions for models and columns
- Applying rule-based test generation for primary keys, timestamps, and composite keys
- Smart-merging generated content with existing manual edits
LLM Providers¶
dbt-schema-gen supports three LLM providers:
| Provider | Environment Variable | Default Model |
|---|---|---|
| OpenAI | OPENAI_API_KEY | gpt-4o-mini |
| Anthropic Claude | ANTHROPIC_API_KEY | claude-3-sonnet-20240229 |
| Google Gemini | GEMINI_API_KEY | gemini-1.5-flash |
Select the provider via the LLM_PROVIDER environment variable or the dbt_schema_gen.yml configuration file.
Generation Modes¶
| Mode | What It Generates | Speed |
|---|---|---|
full (default) | Descriptions + data types + tests | Standard |
descriptions | Model and column descriptions only | Fastest |
types | Data type inference only | Fast |
tests | Test blocks only | Fast |
# Full generation
dbt-schema-gen --mode full /path/to/dbt/project
# Descriptions only (fastest for documentation)
dbt-schema-gen --mode descriptions /path/to/dbt/project
Smart Merge¶
When schema.yml files already exist, dbt-schema-gen uses a smart merge strategy that distinguishes between three categories of fields:
Fields Synced from SQL (Always Updated)¶
Config values from the SQL {{ config() }} block are always kept in sync:
config.materializedconfig.tags
If you change tags in your SQL model, the schema file updates automatically on the next run.
Preserved Fields (Your Manual Edits)¶
Fields you manually add to the YAML are preserved across regenerations:
meta.ownermeta.authoritative- Any custom
metafields you define
Regenerated Fields (LLM Output)¶
These fields are regenerated on each run:
description(model-level and column-level)columns.data_type(inferred from SQL)
The tool tracks which fields it generated via _generated_fields metadata, so it knows the difference between your edits and its own previous output.
Test Generation¶
Tests are generated using rule-based pattern matching, not by the LLM. This ensures only valid dbt tests are produced.
Column-Level Tests¶
| Column Pattern | Tests Added |
|---|---|
*_id, *_pk, *_index | unique, not_null |
*_timestamp, *_at, date | not_null |
transaction_hash, block_hash | unique, not_null |
Model-Level Tests¶
| Pattern | Test Added |
|---|---|
date + dimension column | dbt_utils.unique_combination_of_columns |
Invalid tests (ones that do not exist in dbt or dbt_utils) are automatically removed during validation.
CLI Usage¶
# Generate schemas for all models in a project
dbt-schema-gen /path/to/dbt/project
# Generate for specific models
dbt-schema-gen -m fct_daily_sales,stg_transactions /path/to/project
# Generate from a subdirectory
dbt-schema-gen /path/to/dbt/project/models/staging
# Force regeneration (overwrite even if columns unchanged)
dbt-schema-gen -o /path/to/project
# Dry run (preview without writing)
dbt-schema-gen --dry-run -v /path/to/project
# Skip test generation
dbt-schema-gen --skip-tests /path/to/project
# Validate existing schemas without regenerating
dbt-schema-gen --validate-only /path/to/project
CLI Options¶
| Option | Description |
|---|---|
-m, --models TEXT | Only process specific models (comma-separated) |
-o, --overwrite | Force regeneration even if columns are unchanged |
--skip-tests | Do not generate test blocks |
--mode | Generation mode: full, descriptions, types, tests |
--dry-run | Preview changes without writing files |
--preserve-config | Preserve user-maintained fields (default: true) |
-v, --verbose | Show detailed output and warnings |
--validate-only | Validate existing schemas only |
--config PATH | Path to custom configuration file |
Configuration¶
Environment Variables¶
# Provider selection
LLM_PROVIDER=openai # openai | anthropic | gemini
# OpenAI
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_TEMPERATURE=0.3
# Anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-3-sonnet-20240229
ANTHROPIC_TEMPERATURE=0.3
# Google Gemini
GEMINI_API_KEY=AIza...
GEMINI_MODEL=gemini-1.5-flash
GEMINI_TEMPERATURE=0.3
# Rate limiting
GLOBAL_MAX_RPM=10 # Max requests per minute
Project Configuration File¶
Create dbt_schema_gen.yml in the dbt project root for persistent settings:
llm:
provider: openai
model: gpt-4o-mini
temperature: 0.3
generation:
mode: full
skip_tests: false
merge:
preserve_user_fields:
- meta.owner
- meta.authoritative
always_regenerate:
- description
- columns.description
- columns.data_type
- config.tags
- config.materialized
tests:
enabled: true
packages:
- dbt
- dbt_utils
auto_tests:
primary_key: true
not_null_timestamps: true
relationships: false
max_tests_per_column: 3
parsing:
use_sqlglot: true
fallback_to_regex: true
Column Comments¶
You can add hints in your SQL files to improve the quality of LLM-generated descriptions:
-- @column date: The calendar date for this aggregation (UTC timezone)
-- @column total_value: Sum of all transaction values in wei
SELECT
toStartOfDay(block_timestamp) AS date,
SUM(value) AS total_value
FROM {{ ref('stg_execution__transactions') }}
GROUP BY 1
Jinja comment syntax is also supported:
How It Works¶
- Parse SQL -- Uses sqlglot (with regex fallback) to extract column names,
{{ config() }}blocks, and{{ ref() }}/{{ source() }}references - Build context -- Gathers upstream model schemas and source definitions for richer LLM prompts
- Call LLM -- Sends focused prompts to generate model and column descriptions
- Generate tests -- Applies rule-based pattern matching to produce valid tests
- Smart merge -- Merges generated content with existing
schema.yml, preserving manual edits - Write files -- Outputs
schema.ymlto each model's directory
Integration with dbt-cerebro¶
dbt-schema-gen is used alongside dbt-cerebro to maintain documentation for the ~400 models in the project. A typical workflow:
# After adding or modifying models
dbt-schema-gen /path/to/dbt-cerebro
# Review changes
git diff models/
# Commit updated schemas
git add models/*/schema.yml
For CI/CD pipelines, use --validate-only to check that schemas are up to date without modifying files.