Preferences


The author mentions it's more than just parquet which duck "just works" with. They're right. Here's a single query against postgres, parquet, json, csv, and Google Sheets:

    ATTACH 'postgres://your_username_password_host:5432/postgres' AS pgdb (TYPE postgres, READ_ONLY);
    with interests as (
        select *
        from
        read_parquet('https://storage.googleapis.com/duck-demo-data/user_interests.parquet')
    ), user_preferences as (
        select *
        from read_json_auto(
            'https://storage.googleapis.com/duck-demo-data/user_preferences.jsonl',
            format = 'newline_delimited',
            records = true
            )
    ), user_details as (
        SELECT * FROM pgdb.user_details
    ), users as (
        select *
        from read_csv_auto('https://storage.googleapis.com/duck-demo-data/users.csv')
    ), one_more_thing as (
        SELECT *
        FROM read_csv_auto(
            'https://docs.google.com/spreadsheets/export?format=csv&id=1O-sbeSxCpzhzZj5iTRnOplZX-dIAiQJAeIO0mlh2kSU',
            normalize_names=True
            )
    )
    select
    users.user_id,
    users.name,
    interests.interest,
    user_preferences.theme,
    user_preferences.language,
    user_details.hobby,
    one_more_thing.one_more_thing
    from users left join
    interests on users.user_id = interests.user_id left join
    user_preferences on users.user_id = user_preferences.user_id left join
    user_details on users.user_id = user_details.user_id left join
    one_more_thing on users.user_id = one_more_thing.user_id

https://www.definite.app/blog/query-any-ducking-thing
Haha, that's ridiculous in the best sort of way.

I feel duckdb is in serious risk of becoming the load bearing glue that holds everything together in some cursed future software stack. A sort of curl-of-data-integration.

If you do this its really worth thinking about the order of data and the size of the partitions. Bigger ones make sense locally but much smaller ones make sense remotely.

Ordering your files how you query them makes a big difference, the goal is to select only the subset of data you care about.

Lastly, this used to require other libs but may be baked in to duckdb now but you can have a single file with all the parquet metadata in. This then means you can load a single data file in and then know where to go for the rest.

“DuckDB in general keeps surprising me with these unobtrusive low friction integrations.”

I couldn’t agree more. DuckDB has been fantastic. I originally started playing around with it for poking at parquet files, but often use it for CSVs, JSON, whatever.

And another thing to note is duckDB plays really nice in a Jupyter notebook. Very useful when playing around with different data.

On the Parquet topic, the Polars lazyframe workflow of operating on a Parquet file is so convenient for me when wrangling hundreds of millions of rows. You avoid loading the data into a SQL server if you don't need to.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal