Five layered interventions — linking, representation, retrieval, validation, semantic layer — each catching a failure mode the one before it misses. Build in order.
Align the question to the tables and columns it actually needs.
Serialize selected schema so the model reads types, keys, and values correctly.
Fit large schemas via embedding retrieval and dynamic pruning.
Execute, capture errors, loop until the database accepts the query.
Define business metrics and domain rules once, outside the prompt.
Techniques
Recall vs. false-positive tradeoff
Formats ranked
CREATE TABLE with types, PK, FK; preferred in DAIL-SQL study [6]Evidence impact
Systems
Mechanisms (stack these)
What it fixes
Warehouse-native tools
NLSQLTableQueryEngine + ObjectIndex — retrieves only relevant tables to avoid context overflow. [37]
Agentic: schema tools + Context Store + dynamic golden Q/SQL retrieval via GetFewShotExamples. [35]
| System | Benchmark | Score | Note |
|---|---|---|---|
| AskData + GPT-4o [10] | BIRD test | 81.95% | oracle evidence |
| Human baseline [10] | BIRD test | 92.96% | ceiling |
| CHASE-SQL [28] | BIRD test | 73.0% | oracle upper bound 82.79% |
| DAIL-SQL + GPT-4 [7] | Spider leaderboard | 86.6% | Code Representation prompt |
| DivSkill-SQL [13] | Spider 2.0-Lite | 73.13% | enterprise DBs (>1k cols) |
| GPT-4o [13] | Spider 2.0 | 10.1% | vs 86.6% on Spider 1.0 |
| Full-schema prompting [29] | Spider dev | 94.62% | strong model — linking hurts |