arXiv cs.AI·9 June 2026

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

Signal

Hype

In three linesUniQL is a benchmark of 24,544 SQL queries across 16 dialects (MySQL, PostgreSQL, T-SQL, etc.) to evaluate LLM generalization in text-to-SQL tasks. Experiments show current LLMs fail to generalize across dialects, with substantial performance variation across database systems.

## UniQL: Why SQLite-centric text-to-SQL benchmarking is a competence illusion

### 1. The prior state

The dominant text-to-SQL benchmarks — Spider, BIRD, WikiSQL — run almost exclusively on SQLite. This convergence has a practical explanation: SQLite is syntactically permissive, serverless, and trivial to embed in evaluation pipelines. The consequence: since 2018, text-to-SQL leaderboards have measured the ability to produce valid SQLite on academic schemas. Scores climbed accordingly (GPT-4 exceeds 85% on Spider), creating the impression that the problem is nearly solved.

UniQL breaks that illusion with a direct question: does a model that succeeds on SQLite produce correct SQL for MySQL, PostgreSQL, T-SQL, or the 13 other dialects covered?

### 2. What UniQL actually measures

The benchmark aligns **1,534 natural language questions** with executable SQL annotations across **16 dialects**, yielding **24,544 dialect-specific queries**. All dialects share identical intents, aligned schemas, and identical database contents — enabling controlled evaluation without semantic difficulty bias.

Construction uses a hybrid pipeline: database migration, automated SQL translation, execution-guided verification, iterative rule summarization, and final human validation. That last step matters: a query can return correct results for syntactically wrong reasons on permissive dialects. Human validation filters those false positives.

The 16 dialects span systems that differ substantially in type systems, aggregation functions, date handling, window function syntax, and implicit cast behavior. T-SQL (SQL Server) diverges from PostgreSQL on basics as fundamental as string concatenation (`+` vs `||`) or NULL handling in aggregates.

### 3. Results: generalization failure is severe

Experiments on both open-source and closed-source LLMs show **substantial performance variation across database systems**. The central finding is that SQLite success does not transfer to other dialects — which directly invalidates current leaderboard rankings as proxies for real-world competence.

The mechanism is straightforward: LLM training corpora massively over-represent SQLite (Stack Overflow, tutorials, GitHub). Enterprise dialects like T-SQL or PL/pgSQL appear less frequently, and rarely in contexts featuring complex queries (recursive CTEs, dialect-specific analytic functions). Models learn SQLite grammar with superficial variations, not an understanding of dialect-specific execution semantics.

### 4. Who loses, who gains

**Direct losers:** Teams that optimized text-to-SQL models on Spider/BIRD and communicate high scores as evidence of production-readiness. UniQL exposes that those scores do not predict performance on PostgreSQL or MySQL — the two most widely deployed dialects in production. NL-to-SQL vendors (Text2SQL.ai, Defog, Vanna) whose marketing benchmarks rely on SQLite are directly challenged.

**Potential winners:** Teams building dialect-aware pipelines — whether through dialect-specific fine-tuning, prompting with injected dialect documentation, or syntactic validation post-processing. UniQL finally provides a clean evaluation signal for these approaches.

**Impact on evaluation practice:** For any text-to-SQL deployment on a non-SQLite DBMS, UniQL immediately becomes the relevant reference benchmark. Using Spider as a proxy for PostgreSQL was already a questionable approximation; after UniQL, it is indefensible.

Code and data are publicly available on GitHub (JerryGao818/UniQL), enabling immediate integration into existing evaluation pipelines. The critical next question is whether models fine-tuned on UniQL generalize better across dialects, or whether dialectal variation is deep enough to require specialized models per dialect family.

Read source

Your take?

Benchmarks Code generation Evals

Summary generated by Claude — human-verified

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

Other angles on this story