[ comments ]
An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB’s architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB’s processing speed makes it easier to get more work accomplished.
However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on our roadmap. If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on GitHub or Discord and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!
A traditional SQL SELECT
query requires that requested columns be explicitly specified, with one notable exception: the *
wildcard. SELECT *
allows SQL to return all relevant columns. This adds tremendous flexibility, especially when building queries on top of one another. However, we are often interested in almost all columns. In DuckDB, simply specify which columns to EXCLUDE
:
SELECT * EXCLUDE (jar_jar_binks, midichlorians) star_wars
Now we can save time repeatedly typing all columns, improve code readability, and retain flexibility as additional columns are added to underlying tables.
DuckDB’s implementation of this concept can even handle exclusions from multiple tables within a single statement:
SELECT
sw.* EXCLUDE (jar_jar_binks, midichlorians),
ff.* EXCLUDE cancellation
FROM star_wars sw, firefly ff
Similarly, we often wish to use all of the columns in a table, aside from a few small adjustments. This would also prevent the use of *
and require a list of all columns, including those that remain unedited. In DuckDB, easily apply changes to a small number of columns with REPLACE
:
SELECT
* REPLACE (movie_count+3 as movie_count, show_count* as show_count)
FROM star_wars_owned_by_disney
This allows views, CTE’s, or sub-queries to be built on one another in a highly concise way, while remaining adaptable to new underlying columns.
A common cause of repetitive and verbose SQL code is the need to specify columns in both the SELECT
clause and the GROUP BY
clause. In theory this adds flexibility to SQL, but in practice it rarely adds value. DuckDB now offers the GROUP BY
we all expected when we first learned SQL - just GROUP BY ALL
columns in the SELECT
clause that aren’t wrapped in an aggregate function!
SELECT
systems,
planets,
cities,
cantinas,
SUM(scum + villainy) as total_scum_and_villainy
FROM star_wars_locations
GROUP BY ALL
-- GROUP BY systems, planets, cities, cantinas
Now changes to a query can be made in only one place instead of two! Plus this prevents many mistakes where columns are removed from a SELECT
list, but not from the GROUP BY
, causing duplication.
Not only does this dramatically simplify many queries, it also makes the above EXCLUDE
and REPLACE
clauses useful in far more situations. Imagine if we wanted to adjust the above query by no longer considering the level of scum and villainy in each specific cantina:
SELECT
* EXCLUDE (cantinas, booths, scum, villainy),
SUM(scum + villainy) as total_scum_and_villainy
FROM star_wars_locations
GROUP BY ALL
-- GROUP BY systems, planets, cities
Now that is some concise and flexible SQL! How many of your GROUP BY
clauses could be re-written this way?
Another common cause for repetition in SQL is the ORDER BY
clause. DuckDB and other RDBMSs have previously tackled this issue by allowing queries to specify the numbers of columns to ORDER BY
(For example, ORDER BY 1, 2, 3
). However, frequently the goal is to order by all columns in the query from left to right, and maintaining that numeric list when adding or subtracting columns can be error prone. In DuckDB, simply ORDER BY ALL
:
SELECT
age,
sum(civility) as total_civility
FROM star_wars_universe
ORDER BY ALL
-- ORDER BY age, total_civility
This is particularly useful when building summaries, as many other client tools automatically sort results in this manner. DuckDB also supports ORDER BY ALL DESC
to sort each column in reverse order, and options to specify NULLS FIRST
or NULLS LAST
.
In many SQL dialects, it is not possible to use an alias defined in a SELECT
clause anywhere but in the ORDER BY
clause of that statement. This commonly leads to verbose CTE’s or subqueries in order to utilize those aliases. In DuckDB, a non-aggregate alias in the SELECT
clause can be immediately used in the WHERE
and GROUP BY
clauses, and aggregate aliases can be used in the HAVING
clause, even at the same query depth. No subquery needed!
SELECT
only_imperial_storm_troopers_are_so_precise as nope,
turns_out_a_parsec_is_a_distance as very_speedy,
SUM(mistakes) as total_oops
FROM oops
WHERE
nope = 1
GROUP BY
nope,
very_speedy
HAVING
total_oops > 0
DuckDB allows queries to be case insensitive, while maintaining the specified case as data flows into and out of the system. This simplifies queries within DuckDB while ensuring compatibility with external libraries.
CREATE TABLE mandalorian as SELECT 1 as "THIS_IS_THE_WAY";
SELECT this_is_the_way FROM mandalorian;
THIS_IS_THE_WAY |
---|
Regardless of expertise, and despite DuckDB’s best efforts to understand our intentions, we all make mistakes in our SQL queries. Many RDBMSs leave you trying to use the force to detect an error. In DuckDB, if you make a typo on a column or table name, you will receive a helpful suggestion about the most similar name. Not only that, you will receive an arrow that points directly to the offending location within your query.
select * from star_trek;
Error: Catalog Error: Table with name star_trek does not exist!
Did you mean "star_wars"?
LINE 1: select * from star_trek;
^
(Don’t worry, ducks and duck-themed databases still love some Trek as well).
DuckDB’s suggestions are even context specific. Here, we receive a suggestion to use the most similar column from the table we are querying.
select long_ago from star_wars;
Error: Binder Error: Referenced column "long_ago" not found in FROM clause!
Candidate bindings: "star_wars.long_long_ago"
LINE 1: select long_ago from star_wars;
^
Even as SQL fans, we know that SQL can learn a thing or two from newer languages. Instead of using bulky SUBSTRING
functions, you can slice strings in DuckDB using bracket syntax. As a note, SQL is required to be 1-indexed, so that is a slight difference from other languages (although it keeps DuckDB internally consistent and similar to other DBs).
SELECT 'I love you! I know'[:-3] as nearly_soloed;
nearly_soloed |
---|
I love you! I k |
DuckDB provides nested types to allow more flexible data structures than the purely relational model would allow, while retaining high performance. To make them as easy as possible to use, creating a LIST
(array) or a STRUCT
(object) uses simpler syntax than other SQL systems. Data types are automatically inferred.
SELECT
['A-Wing', 'B-Wing', 'X-Wing', 'Y-Wing'] as starfighter_list,
{name: 'Star Destroyer', common_misconceptions: 'Can''t in fact destroy a star'} as star_destroyer_facts
Bracket syntax may also be used to slice a LIST
. Again, note that this is 1-indexed for SQL compatibility.
SELECT
starfighter_list[2:2] as dont_forget_the_b_wing
FROM (SELECT ['A-Wing', 'B-Wing', 'X-Wing', 'Y-Wing'] as starfighter_list);
dont_forget_the_b_wing |
---|
[B-Wing] |
Use convenient dot notation to access the value of a specific key in a DuckDB STRUCT
column. If keys contain spaces, double quotes can be used.
SELECT
planet.name,
planet."Amount of sand"
FROM (SELECT {name: 'Tatooine', 'Amount of sand': 'High'} as planet)
Have you ever removed your final column from a SQL SELECT
and been met with an error, only to find you needed to remove the trailing comma as well!? Never? Ok, Jedi… On a more serious note, this feature is an example of DuckDB’s responsiveness to the community. In under 2 days from seeing this issue in a tweet (not even about DuckDB!), this feature was already built, tested, and merged into the primary branch. You can include trailing commas in many places in your query, and we hope this saves you from the most boring but frustrating of errors!
SELECT
x_wing,
proton_torpedoes,
--targeting_computer
FROM luke_whats_wrong
GROUP BY
x_wing,
proton_torpedoes,
For many functions, DuckDB supports multiple names in order to align with other database systems. After all, ducks are pretty versatile - they can fly, swim, and walk! Most commonly, DuckDB supports PostgreSQL function names, but many SQLite names are supported, as well as some from other systems. If you are migrating your workloads to DuckDB and a different function name would be helpful, please reach out - they are very easy to add as long as the behavior is the same! See our functions documentation for details.
SELECT
'Use the Force, Luke'[:13] as sliced_quote_1,
substr('I am your father', 1, 4) as sliced_quote_2,
substring('Obi-Wan Kenobi, you''re my only hope',17,100) as sliced_quote_3
As you are building a query that joins similar tables, you’ll often encounter duplicate column names. If the query is the final result, DuckDB will simply return the duplicated column names without modifications. However, if the query is used to create a table, or nested in a subquery or Common Table Expression (where duplicate columns are forbidden by other databases!), DuckDB will automatically assign new names to the repeated columns to make query prototyping easier.
SELECT
*
FROM (
SELECT
s1.tie_fighter,
s2.tie_fighter
FROM squadron_one s1
JOIN squadron_two s2
ON 1=1
) theyre_coming_in_too_fast
tie_fighter | tie_fighter:1 |
---|---|
green_one | green_two |
DuckDB believes in using specific data types for performance, but attempts to automatically cast between types whenever necessary. For example, when joining between an integer and a varchar, DuckDB will automatically cast them to be the same type and complete the join successfully. A List
or IN
expression may also be created with a mixture of types, and they will be automatically cast as well. Also, INT
and BIGINT
are interchangeable, and thanks to DuckDB’s new storage compression, a BIGINT
usually doesn’t even take up any extra space! Now you can store your data as the optimal data type, but use it easily for the best of both!
CREATE TABLE sith_count_int as SELECT 2::INT as sith_count;
CREATE TABLE sith_count_varchar as SELECT 2::VARCHAR as sith_count;
SELECT
*
FROM sith_count_int s_int
JOIN sith_count_varchar s_char
on s_int.sith_count = s_char.sith_count;
sith_count | sith_count |
---|
There are many other features of DuckDB that make it easier to analyze data with SQL!
DuckDB makes working with time easier in many ways, including by accepting multiple different syntaxes (from other databases) for the INTERVAL
data type used to specify a length of time.
DuckDB also implements multiple SQL clauses outside of the traditional core clauses including the SAMPLE
clause for quickly selecting a random subset of your data and the QUALIFY
clause that allows filtering of the results of window functions (much like a HAVING
clause does for aggregates).
The DISTINCT ON
clause allows DuckDB to select unique combinations of a subset of the columns in a SELECT
clause, while returning the first row of data for columns not checked for uniqueness.
In addition to what has already been implemented, several other improvements have been suggested. Let us know if one would be particularly useful - we are flexible with our roadmap! If you would like to contribute, we are very open to PRs and you are welcome to reach out on GitHub or Discord ahead of time to talk through a new feature’s design.
- Choose columns via regex
- Decide which columns to select with a pattern rather than specifying columns explicitly
- Clickhouse supports this with the
COLUMNS
expression
- Incremental column aliases
- Refer to previously defined aliases in subsequent calculated columns rather than re-specifying the calculations
- Dot operators for JSON types
- The JSON extension is brand new (see our documentation!) and already implements friendly
->
and->>
syntax
- The JSON extension is brand new (see our documentation!) and already implements friendly
Thanks for checking out DuckDB! May the Force be with you…
[ comments ]