← Default view
Audit and harden text-to-SQL pipelines (2026)  /  Schema Grounding Strategies

Schema Grounding is a Stack

Five layered interventions — linking, representation, retrieval, validation, semantic layer — each catching a failure mode the one before it misses. Build in order.

10.1% GPT-4o on Spider 2.0 [13]
73% Best agent, Spider 2.0-Lite [13]
83% With semantic layer [41]
~40% Raw SQL, undecorated schema [41]
42 Sources · expedition depth

Pipeline overview

L1

Linking & Selection

Align the question to the tables and columns it actually needs.

String match → neural → LLM selection
L2

Representation

Serialize selected schema so the model reads types, keys, and values correctly.

DDL / M-Schema +2.03% over raw DDL
L3

Retrieval & Pruning

Fit large schemas via embedding retrieval and dynamic pruning.

Pruning oracle: 79.3% → 86.3%
L4

Validation & Repair

Execute, capture errors, loop until the database accepts the query.

DIN-SQL self-correction +1.73 pts
L5

Semantic Layer

Define business metrics and domain rules once, outside the prompt.

dbt semantic layer: 83% vs ~40%

Layer details

L1

Schema Linking & Selection

94.62% full-schema EX (strong model)

Techniques

  • RAT-SQL — n-gram string matching, subsequence partial match, relation-aware encoder [1]
  • Neural alignment — deep models learn question↔schema attention [4]
  • LLM in-context selection — LLM picks relevant subset before generation [4]
  • CRUSH4SQL — hallucinate minimal schema, dense-retrieve real match; scales to 17,844 elements [3]
  • Bidirectional retrieval — table-first ∪ column-first to balance recall/FPR [2]

Recall vs. false-positive tradeoff

CHESS recall 97.12%
CHESS FPR 30.57%
RSL-SQL recall 93.28%
RSL-SQL FPR 68.23%
Full schema (strong model) 94.62% EX [29]
The twist. "The Death of Schema Linking?" — strong models like Gemini 1.5 Pro see a net EX reduction when schema linking is applied. Full-schema prompting reaches 94.62% EX. Schema linking is a context-management tool, not an accuracy tool. Skip it if your schema fits and your model reasons well. [29]
L2

Schema Representation

+2.03% M-Schema vs DDL

Formats ranked

  • M-Schema — name, type, description, PK flag, sample values per column; special tokens for DB/table/FK [8]
  • Code Representation (DDL)CREATE TABLE with types, PK, FK; preferred in DAIL-SQL study [6]
  • OpenAI Demonstration — second-best in zero-shot evaluation [6]
  • Sample rows per table — reveals actual values and semantics; truncate long strings [12]
  • External knowledge / evidence — domain rules the schema can't express [10]

Evidence impact

ChatGPT, no evidence 37.22%
ChatGPT, with evidence 42.24% [10]
GPT-4, no evidence 46.35%
GPT-4, with evidence 49.15% [10]
M-Schema vs raw DDL +2.03% avg [27]
Catch. BIRD's evidence is hand-written per query by domain experts — "not a realistic deployment scenario." A deployed agent must retrieve or infer domain context itself. This is the gap the semantic layer exists to fill. [11]
L3

Retrieval & Pruning at Scale

79.3% → 86.3% oracle pruning gain

Systems

  • LitE-SQL — per-column embeddings in ChromaDB, top-25 retrieval, ~25s vs CHESS 84s, 72.1% BIRD [16]
  • CHESS — LSH + vector DB for value/catalog search, adaptive pruning (column filter → table select → final select) [5]
  • CRED-SQL — K-means column clustering, inverse-cluster-size weighting; recall 0.09 → 0.40 @1 [15]
  • LinkAlign — multi-DB routing: query rewrite → argmax LLM → multi-agent; 33.09% Spider 2.0-Lite [14]
  • Hierarchical clustering RAG — dynamic schema sizing vs brittle fixed-k [18]

Pruning numbers

Full oracle schema 79.3% EX
Pruned 71.3% of columns 86.3% EX [17]
CRED-SQL table recall @1 0.40 (vs CRUSH 0.09) [15]
CRED-SQL end-to-end lift +22 pts over CRUSH [15]
Small models, chunk ≤64 cols value literal retrieval [17]
L4

Validation, Execution Feedback & Self-Correction

most reliable lever grounds against live DB

Mechanisms (stack these)

  • PICARD — grammar decoding: incremental SQL parser rejects invalid tokens during generation; T5-3B 74.4%→79.3% EX [20]
  • Execution-guided decoding — conditions on partial execution, prunes error-ing candidates; 83.8% WikiSQL [25]
  • MAC-SQL Refiner — execute → capture error → re-prompt to fix; 59.59% BIRD test, 86.75% Spider dev [23]
  • DIN-SQL self-correction +1.73 pts; MAGIC auto-derives guideline from failure analysis → 85.66% [22]
  • CSC-SQL voting — merge-revises top-2 sampled SQLs; 73.67% BIRD private test (32B) [24]

Accuracy gains

PICARD: T5-3B Spider dev 74.4% → 79.3% [20]
DIN-SQL self-correction 78.62% → 80.35% [22]
MAGIC vs manual guideline 85.66% Spider [22]
CHASE-SQL oracle selection 82.79% upper bound [28]
SSEV weighted voting 86.4% Spider test [26]
L5

Semantic Layer

83% vs ~40% semantic layer vs raw SQL

What it fixes

  • Raw schemas lack business process definitions, metric handling, synonym resolution [39]
  • Define "total sales," dimensions, and filters once; reuse rather than re-infer per query [34]
  • BIRD's hand-crafted evidence approximates a semantic layer, but can't scale — this does [11]
  • OSI (Jan 2026) — vendor-neutral YAML standard, 40+ partners (Snowflake, dbt, Cube, Databricks) [42]

Warehouse-native tools

Snowflake Cortex Analyst YAML semantic model [39]
Databricks Genie Unity Catalog metadata [40]
BigQuery Gemini Dataform views [40]
dbt Semantic Layer 83% vs ~40% [41]
WrenAI MDL ⭐ 15k open-source [33]

Grounding recipe

  1. Always
    Serialize as DDL or M-Schema with types, PK/FK, column descriptions, and a few sample values. [6][8]
  2. Small / medium DB + strong model
    Feed the full schema. Skip explicit linking filters — they can hurt top models. Full-schema prompting reaches 94.62% EX. [29]
  3. Large DB (>~hundreds of columns)
    Embedding retrieval + dynamic (not fixed-k) pruning + literal value retrieval. Static top-k is brittle — too small drops tables, too large injects noise. [16][18][17]
  4. Always
    Execute the candidate, repair on error, and vote across samples. Execution feedback grounds against the actual database, not a representation of it. [23][24]
  5. Enterprise reliability
    Ground on a semantic layer / metadata catalog so domain rules and metric definitions live outside the prompt. [39][41]

Open-source tools

LangChain repository banner
LangChain ⭐ 139k

SQLDatabase / SQLDatabaseChain. Manual prompt + schema filtering. Requires careful engineering for large DBs. [38]

manual filtering
LlamaIndex ⭐ 50k

NLSQLTableQueryEngine + ObjectIndex — retrieves only relevant tables to avoid context overflow. [37]

table retrieval
Vanna AI repository banner
Vanna ⭐ 24k

RAG model trained on DDL + docs + example SQL; retrieves top-10 per query into the LLM prompt. [31]

RAG trained
WrenAI repository banner
WrenAI ⭐ 15k

Open context layer. MDL encodes schema, metrics, and relationships. Defines business meaning once; reuses across queries. [33]

semantic MDL
Dataherald ⭐ 3.6k

Agentic: schema tools + Context Store + dynamic golden Q/SQL retrieval via GetFewShotExamples. [35]

agentic + golden SQL
DAIL-SQL ⭐ 635

Research baseline: Code Representation prompt + similarity-masked demonstrations. 83.5% EX Spider-dev with GPT-4. [7]

research baseline

Benchmark landscape

System Benchmark Score Note
AskData + GPT-4o [10] BIRD test 81.95%
oracle evidence
Human baseline [10] BIRD test 92.96%
ceiling
CHASE-SQL [28] BIRD test 73.0%
oracle upper bound 82.79%
DAIL-SQL + GPT-4 [7] Spider leaderboard 86.6%
Code Representation prompt
DivSkill-SQL [13] Spider 2.0-Lite 73.13%
enterprise DBs (>1k cols)
GPT-4o [13] Spider 2.0 10.1%
vs 86.6% on Spider 1.0
Full-schema prompting [29] Spider dev 94.62%
strong model — linking hurts

In this expedition