Audit and harden text-to-SQL pipelines (2026) / Schema Grounding Strategies

Schema Grounding is a Stack

Five layered interventions — linking, representation, retrieval, validation, semantic layer — each catching a failure mode the one before it misses. Build in order.

10.1% GPT-4o on Spider 2.0 ^[13]

73% Best agent, Spider 2.0-Lite ^[13]

83% With semantic layer ^[41]

~40% Raw SQL, undecorated schema ^[41]

42 Sources · expedition depth

Pipeline overview

Linking & Selection

Align the question to the tables and columns it actually needs.

String match → neural → LLM selection

Representation

Serialize selected schema so the model reads types, keys, and values correctly.

DDL / M-Schema +2.03% over raw DDL

Retrieval & Pruning

Fit large schemas via embedding retrieval and dynamic pruning.

Pruning oracle: 79.3% → 86.3%

Validation & Repair

Execute, capture errors, loop until the database accepts the query.

DIN-SQL self-correction +1.73 pts

Semantic Layer

Define business metrics and domain rules once, outside the prompt.

dbt semantic layer: 83% vs ~40%

Layer details

Schema Linking & Selection

94.62% full-schema EX (strong model)

Techniques

RAT-SQL — n-gram string matching, subsequence partial match, relation-aware encoder ^[1]
Neural alignment — deep models learn question↔schema attention ^[4]
LLM in-context selection — LLM picks relevant subset before generation ^[4]
CRUSH4SQL — hallucinate minimal schema, dense-retrieve real match; scales to 17,844 elements ^[3]
Bidirectional retrieval — table-first ∪ column-first to balance recall/FPR ^[2]

Recall vs. false-positive tradeoff

CHESS recall 97.12%

CHESS FPR 30.57%

RSL-SQL recall 93.28%

RSL-SQL FPR 68.23%

Full schema (strong model) 94.62% EX ^[29]

The twist. "The Death of Schema Linking?" — strong models like Gemini 1.5 Pro see a net EX reduction when schema linking is applied. Full-schema prompting reaches 94.62% EX. Schema linking is a context-management tool, not an accuracy tool. Skip it if your schema fits and your model reasons well. ^[29]

Schema Representation

+2.03% M-Schema vs DDL

Formats ranked

M-Schema — name, type, description, PK flag, sample values per column; special tokens for DB/table/FK ^[8]
Code Representation (DDL) — CREATE TABLE with types, PK, FK; preferred in DAIL-SQL study ^[6]
OpenAI Demonstration — second-best in zero-shot evaluation ^[6]
Sample rows per table — reveals actual values and semantics; truncate long strings ^[12]
External knowledge / evidence — domain rules the schema can't express ^[10]

Evidence impact

ChatGPT, no evidence 37.22%

ChatGPT, with evidence 42.24% ^[10]

GPT-4, no evidence 46.35%

GPT-4, with evidence 49.15% ^[10]

M-Schema vs raw DDL +2.03% avg ^[27]

Catch. BIRD's evidence is hand-written per query by domain experts — "not a realistic deployment scenario." A deployed agent must retrieve or infer domain context itself. This is the gap the semantic layer exists to fill. ^[11]

Retrieval & Pruning at Scale

79.3% → 86.3% oracle pruning gain

Systems

LitE-SQL — per-column embeddings in ChromaDB, top-25 retrieval, ~25s vs CHESS 84s, 72.1% BIRD ^[16]
CHESS — LSH + vector DB for value/catalog search, adaptive pruning (column filter → table select → final select) ^[5]
CRED-SQL — K-means column clustering, inverse-cluster-size weighting; recall 0.09 → 0.40 @1 ^[15]
LinkAlign — multi-DB routing: query rewrite → argmax LLM → multi-agent; 33.09% Spider 2.0-Lite ^[14]
Hierarchical clustering RAG — dynamic schema sizing vs brittle fixed-k ^[18]

Pruning numbers

Full oracle schema 79.3% EX

Pruned 71.3% of columns 86.3% EX ^[17]

CRED-SQL table recall @1 0.40 (vs CRUSH 0.09) ^[15]

CRED-SQL end-to-end lift +22 pts over CRUSH ^[15]

Small models, chunk ≤64 cols value literal retrieval ^[17]

Validation, Execution Feedback & Self-Correction

most reliable lever grounds against live DB

Mechanisms (stack these)

PICARD — grammar decoding: incremental SQL parser rejects invalid tokens during generation; T5-3B 74.4%→79.3% EX ^[20]
Execution-guided decoding — conditions on partial execution, prunes error-ing candidates; 83.8% WikiSQL ^[25]
MAC-SQL Refiner — execute → capture error → re-prompt to fix; 59.59% BIRD test, 86.75% Spider dev ^[23]
DIN-SQL self-correction +1.73 pts; MAGIC auto-derives guideline from failure analysis → 85.66% ^[22]
CSC-SQL voting — merge-revises top-2 sampled SQLs; 73.67% BIRD private test (32B) ^[24]

Accuracy gains

PICARD: T5-3B Spider dev 74.4% → 79.3% ^[20]

DIN-SQL self-correction 78.62% → 80.35% ^[22]

MAGIC vs manual guideline 85.66% Spider ^[22]

CHASE-SQL oracle selection 82.79% upper bound ^[28]

SSEV weighted voting 86.4% Spider test ^[26]

Semantic Layer

83% vs ~40% semantic layer vs raw SQL

What it fixes

Raw schemas lack business process definitions, metric handling, synonym resolution ^[39]
Define "total sales," dimensions, and filters once; reuse rather than re-infer per query ^[34]
BIRD's hand-crafted evidence approximates a semantic layer, but can't scale — this does ^[11]
OSI (Jan 2026) — vendor-neutral YAML standard, 40+ partners (Snowflake, dbt, Cube, Databricks) ^[42]

Warehouse-native tools

Snowflake Cortex Analyst YAML semantic model ^[39]

Databricks Genie Unity Catalog metadata ^[40]

BigQuery Gemini Dataform views ^[40]

dbt Semantic Layer 83% vs ~40% ^[41]

WrenAI MDL ⭐ 15k open-source ^[33]

Grounding recipe

Always
Serialize as DDL or M-Schema with types, PK/FK, column descriptions, and a few sample values. ^[6][8]
Small / medium DB + strong model
Feed the full schema. Skip explicit linking filters — they can hurt top models. Full-schema prompting reaches 94.62% EX. ^[29]
Large DB (>~hundreds of columns)
Embedding retrieval + dynamic (not fixed-k) pruning + literal value retrieval. Static top-k is brittle — too small drops tables, too large injects noise. ^[16][18][17]
Always
Execute the candidate, repair on error, and vote across samples. Execution feedback grounds against the actual database, not a representation of it. ^[23][24]
Enterprise reliability
Ground on a semantic layer / metadata catalog so domain rules and metric definitions live outside the prompt. ^[39][41]

Open-source tools

LangChain ⭐ 139k

SQLDatabase / SQLDatabaseChain. Manual prompt + schema filtering. Requires careful engineering for large DBs. ^[38]

manual filtering

∑

LlamaIndex ⭐ 50k

NLSQLTableQueryEngine + ObjectIndex — retrieves only relevant tables to avoid context overflow. ^[37]

table retrieval

Vanna ⭐ 24k

RAG model trained on DDL + docs + example SQL; retrieves top-10 per query into the LLM prompt. ^[31]

RAG trained

WrenAI ⭐ 15k

Open context layer. MDL encodes schema, metrics, and relationships. Defines business meaning once; reuses across queries. ^[33]

semantic MDL

⚙

Dataherald ⭐ 3.6k

Agentic: schema tools + Context Store + dynamic golden Q/SQL retrieval via GetFewShotExamples. ^[35]

agentic + golden SQL

⊕

DAIL-SQL ⭐ 635

Research baseline: Code Representation prompt + similarity-masked demonstrations. 83.5% EX Spider-dev with GPT-4. ^[7]

research baseline

Benchmark landscape

System	Benchmark	Score	Note
AskData + GPT-4o ^[10]	BIRD test	81.95%	oracle evidence
Human baseline ^[10]	BIRD test	92.96%	ceiling
CHASE-SQL ^[28]	BIRD test	73.0%	oracle upper bound 82.79%
DAIL-SQL + GPT-4 ^[7]	Spider leaderboard	86.6%	Code Representation prompt
DivSkill-SQL ^[13]	Spider 2.0-Lite	73.13%	enterprise DBs (>1k cols)
GPT-4o ^[13]	Spider 2.0	10.1%	vs 86.6% on Spider 1.0
Full-schema prompting ^[29]	Spider dev	94.62%	strong model — linking hurts

In this expedition

expedition

Naive text-to-SQL failure modes

~97% of wrong SQL executes cleanly and returns a plausible, wrong number.

expedition · this page

Schema grounding strategies

Layered stack: linking, representation, retrieval, feedback.

survey

Read-only enforcement & least-privilege roles

Purpose-built roles, column/row filtering, read replicas as write boundaries.

survey

Query validation & safety gates

Five-layer defense: allowlists, AST checks, schema binding, policy injection.

expedition

pgvector semantic row search

Nearest-neighbor row retrieval, indexing, and grounding text-to-SQL.

recon

Observability & audit logging

Debug production, prove compliance and accountability.

recon

Multi-tenant & row-level security

PostgreSQL RLS — cost-efficient isolation with session variables and non-owner roles.