The dangerous failures are silent. Research into failure modes established that ~97% of wrong LLM-generated SQL executes cleanly and returns a plausible but incorrect number — crashes are not the threat model. This shapes everything downstream: every safeguard layer must operate at the semantic level, not just the syntactic.
The hardening stack has a natural ordering. Read-only role enforcement is the cheapest, most reliable boundary and should be deployed first. Schema grounding — preventing hallucinated tables and columns — must precede query validation (AST structural checks, allowlists) because you cannot validate references you haven’t resolved. Row-level security and observability anchor the ends: RLS is a database-layer contract that no application code can override; observability is the only mechanism for detecting silent correctness failures in production.
Schema grounding and query validation pull in opposite directions philosophically. Schema grounding research favors LLM-based methods: function-calling schema injection, RAG over schema embeddings, and LLM self-reflection as a correction pass. Query validation research favors deterministic methods: SQL parsers, AST analysis, statement allowlists. The tension is real — LLM-based correction is more flexible but non-auditable; AST checks are auditable but cannot catch semantic hallucinations. A production pipeline needs both: deterministic AST checks for structural safety, LLM-mediated grounding for semantic accuracy.
Prompt injection is the residual threat that deterministic layers cannot close. If a user embeds SQL fragments or schema-altering phrases in their natural-language input, function-calling injection and AST validation may both pass without flagging anything. The only hard containment is the database layer — read-only roles (no DML regardless of query content) and RLS (no cross-tenant rows regardless of WHERE clause). These are not optional optimizations; they are the backstop for everything above them.
pgvector extends the paradigm, it doesn’t replace it. Semantic row search — embedding row content and querying by cosine similarity — answers questions that SQL cannot express compactly. The pgvector research shows that hybrid SQL+vector queries (WHERE-clause filtering combined with ANN search using HNSW or IVFFlat) are the production pattern, not pure vector retrieval. Index parameter tuning (ef_construction, m, lists) has a direct recall/throughput tradeoff that must be measured per workload; defaults are rarely appropriate at scale.
Coverage gaps in this expedition. Observability and audit logging were researched at a shallow depth (2-minute read, 7 citations). The multi-tenant/RLS angle received similarly light treatment (3-minute read, 5 citations). Both areas warrant dedicated deep-research runs — particularly around structured logging schemas for LLM query pipelines, anomaly detection over query patterns, and RLS policy testing frameworks.
The open question after all guards are in place: semantic correctness is unverifiable without ground-truth labels. A well-grounded, structurally valid query can silently aggregate on the wrong time window, join on the wrong key, or misinterpret an ambiguous column name. No current automated technique reliably detects this class of error at inference time — and whether LLM self-grading of result plausibility is reliable enough for production quality gates remains the most consequential unresolved question in the text-to-SQL safety literature.