Using DuckDB to seamlessly query a large parquet file over HTTP

15 points May 6, 2024

mritchie712 May 7, 2024

The author mentions it's more than just parquet which duck "just works" with. They're right. Here's a single query against postgres, parquet, json, csv, and Google Sheets:

    ATTACH 'postgres://your_username_password_host:5432/postgres' AS pgdb (TYPE postgres, READ_ONLY);
    with interests as (
        select *
        from
        read_parquet('https://storage.googleapis.com/duck-demo-data/user_interests.parquet')
    ), user_preferences as (
        select *
        from read_json_auto(
            'https://storage.googleapis.com/duck-demo-data/user_preferences.jsonl',
            format = 'newline_delimited',
            records = true
            )
    ), user_details as (
        SELECT * FROM pgdb.user_details
    ), users as (
        select *
        from read_csv_auto('https://storage.googleapis.com/duck-demo-data/users.csv')
    ), one_more_thing as (
        SELECT *
        FROM read_csv_auto(
            'https://docs.google.com/spreadsheets/export?format=csv&id=1O-sbeSxCpzhzZj5iTRnOplZX-dIAiQJAeIO0mlh2kSU',
            normalize_names=True
            )
    )
    select
    users.user_id,
    users.name,
    interests.interest,
    user_preferences.theme,
    user_preferences.language,
    user_details.hobby,
    one_more_thing.one_more_thing
    from users left join
    interests on users.user_id = interests.user_id left join
    user_preferences on users.user_id = user_preferences.user_id left join
    user_details on users.user_id = user_details.user_id left join
    one_more_thing on users.user_id = one_more_thing.user_id

https://www.definite.app/blog/query-any-ducking-thing

marginalia_nu May 7, 2024

Haha, that's ridiculous in the best sort of way.

I feel duckdb is in serious risk of becoming the load bearing glue that holds everything together in some cursed future software stack. A sort of curl-of-data-integration.

IanCal May 6, 2024

If you do this its really worth thinking about the order of data and the size of the partitions. Bigger ones make sense locally but much smaller ones make sense remotely.

Ordering your files how you query them makes a big difference, the goal is to select only the subset of data you care about.

Lastly, this used to require other libs but may be baked in to duckdb now but you can have a single file with all the parquet metadata in. This then means you can load a single data file in and then know where to go for the rest.

cyberlurker May 6, 2024

“DuckDB in general keeps surprising me with these unobtrusive low friction integrations.”

I couldn’t agree more. DuckDB has been fantastic. I originally started playing around with it for poking at parquet files, but often use it for CSVs, JSON, whatever.

And another thing to note is duckDB plays really nice in a Jupyter notebook. Very useful when playing around with different data.

kipukun May 6, 2024

On the Parquet topic, the Polars lazyframe workflow of operating on a Parquet file is so convenient for me when wrangling hundreds of millions of rows. You avoid loading the data into a SQL server if you don't need to.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous