The author mentions it's more than just parquet which duck "just works" with. They're right. Here's a single query against postgres, parquet, json, csv, and Google Sheets:
ATTACH 'postgres://your_username_password_host:5432/postgres' AS pgdb (TYPE postgres, READ_ONLY);
with interests as (
select *
from
read_parquet('https://storage.googleapis.com/duck-demo-data/user_interests.parquet')
), user_preferences as (
select *
from read_json_auto(
'https://storage.googleapis.com/duck-demo-data/user_preferences.jsonl',
format = 'newline_delimited',
records = true
)
), user_details as (
SELECT * FROM pgdb.user_details
), users as (
select *
from read_csv_auto('https://storage.googleapis.com/duck-demo-data/users.csv')
), one_more_thing as (
SELECT *
FROM read_csv_auto(
'https://docs.google.com/spreadsheets/export?format=csv&id=1O-sbeSxCpzhzZj5iTRnOplZX-dIAiQJAeIO0mlh2kSU',
normalize_names=True
)
)
select
users.user_id,
users.name,
interests.interest,
user_preferences.theme,
user_preferences.language,
user_details.hobby,
one_more_thing.one_more_thing
from users left join
interests on users.user_id = interests.user_id left join
user_preferences on users.user_id = user_preferences.user_id left join
user_details on users.user_id = user_details.user_id left join
one_more_thing on users.user_id = one_more_thing.user_id
I feel duckdb is in serious risk of becoming the load bearing glue that holds everything together in some cursed future software stack. A sort of curl-of-data-integration.
If you do this its really worth thinking about the order of data and the size of the partitions. Bigger ones make sense locally but much smaller ones make sense remotely.
Ordering your files how you query them makes a big difference, the goal is to select only the subset of data you care about.
Lastly, this used to require other libs but may be baked in to duckdb now but you can have a single file with all the parquet metadata in. This then means you can load a single data file in and then know where to go for the rest.
“DuckDB in general keeps surprising me with these unobtrusive low friction integrations.”
I couldn’t agree more. DuckDB has been fantastic. I originally started playing around with it for poking at parquet files, but often use it for CSVs, JSON, whatever.
And another thing to note is duckDB plays really nice in a Jupyter notebook. Very useful when playing around with different data.
On the Parquet topic, the Polars lazyframe workflow of operating on a Parquet file is so convenient for me when wrangling hundreds of millions of rows. You avoid loading the data into a SQL server if you don't need to.
I feel duckdb is in serious risk of becoming the load bearing glue that holds everything together in some cursed future software stack. A sort of curl-of-data-integration.
Ordering your files how you query them makes a big difference, the goal is to select only the subset of data you care about.
Lastly, this used to require other libs but may be baked in to duckdb now but you can have a single file with all the parquet metadata in. This then means you can load a single data file in and then know where to go for the rest.
I couldn’t agree more. DuckDB has been fantastic. I originally started playing around with it for poking at parquet files, but often use it for CSVs, JSON, whatever.
And another thing to note is duckDB plays really nice in a Jupyter notebook. Very useful when playing around with different data.