- duckdb -cmd 'CREATE TABLE my_data AS FROM READ_PARQUET($$data.parquet$$)' -ui
`duckdb -ui sqlitedb.db` should work bc duckdb can read sqlite files. If it doesn't autoload extension, you can INSTALL/LOAD in to your ~/.duckdbrc
where startup.sql contains various statements/commands like `.open pre_seeded_db.db`duckdb -ui pre_seeded_db.db duckdb -ui -init startup.sqlAlternatively place statements/commands in `~/.duckdbrc` and just run `duckdb -iu`.
- > In the worst-case scenario, they would be 2-3 years behind the cutting edge, which is not mission-critical.
It's also worth considering how such an event might affect the US and allies. Would it slowdown, perhaps even halt certain operations/efforts for the US. For instance, all those chips the US needs to build supercomputers for "weather research". ;)
- Haven't tried it out yet, but the release notes look very promising esp relating to Arrow interactions
- Push dynamically generated join filters through UNION, UNNEST and AGGREGATE
- Fix arrow table filters
- [Arrow] Fix scan of an object providing the PyCapsuleInterface when projection pushdown is possible.
- DuckDB Arrow Non Canonical Extensions to use arrow.opaque
- Arrow Extension Type to be registered in DuckDB Extensions
- [Python] Use an ArrowQueryResult in FetchArrowTable when possible.
- Perhaps I don't fully understand what you're saying
Or do you mean that you can't use `query_table` with a macro identifier in place of the table identifier as input (after all, where would put the macro args)? Perhaps using `query(...)` instead would work?CREATE TABLE person (name VARCHAR, age BIGINT); INSERT INTO person VALUES ('Alice', 37), ('Ana', 35), ('Bob', 41), ('Bea', 25); CREATE OR REPLACE MACRO my_query_table(t) AS TABLE (SELECT * FROM query_table(t)); SELECT * FROM my_query_table('person');
Or maybe through some other kind of gymnastics https://duckdb.org/2024/03/01/sql-gymnastics.htmlCREATE OR REPLACE MACRO query_macro(m, i) AS TABLE (SELECT * FROM query(CONCAT('SELECT * FROM ', m::STRING, $$($$, i::STRING, $$)$$))); SELECT * FROM query_macro('my_query_table', 'person'); - I've been eagerly awaiting this for a couple of months now. And I've long wondered why there hasn't been such an implementation/extension for Flight especially as there are extensions for mysql, postgres, sqlite...
Only seems natural that an extension could developed for ATTACHing to a Flight server.
Looking forward to watching your talk.
- For read-oriented interactions...
orADBC --sql--> Flight SQL Server --flight--> Flight Servers (plural) --> datasources
where the "relation" is a collection of Arrow stream from a cluster of Flight Servers.SELECT * FROM <relation> WHERE ...Remember that Flight/Flight-SQL is not a query engine replacement, but rather a way to communicate and exchange data between components with different responsibilities efficiently.
- Thank you for all the work you guys do. The Arrow ecosystem is just absolutely incredible.
My few gripes related to interop with duckdb are related to Arrow scanning/pushdowns. And this extends to interop with other projects like pyiceberg too.
Registering an Arrow Dataset (or pyiceberg scan) as a "duckdb relation" (virtual view) is still a little problematic. Querying these "relations" does not always result in an optimal outcome.
For Arrow datasets, you can intercept the duckdb pushdown, but duckdb will have already "optimized" the plan to its liking, and any scanning restrictions that may have been more advantageous based on the nuances of the dataset might have been lost. Eg:
is presented to the Arrow scanner (pushdown) as "A is between 3 and 7 inclusive" (https://duckdb.org/docs/guides/performance/indexing.html#zon...).WHERE A IN (3, 5, 7)Perhaps in a similar way, turning an pyiceberg scan into a relation for duckdb effectively takes the entire scan and creates an Arrow Table rather than some kind of pushdown/"scan plan" for duckdb to potentially make more efficient with its READ_PARQUET() functionality.
Most of this is probably dependent on duckdb development, but all of the incredible interop work done across communities/ecosystems so far gives me a lot of confidence that these will soon be matters of the past.
- And all the Arrow parts work together quite nicely.
The result highlights your exact point: why take your data and transpose it twice?ADBC client --> Flight SQL (duckdb/whatever) --> Flight --> ?It's quite an exciting space, and lots of projects popping up around Arrow Flight and duckdb.
- Since you're using python, have you looked into sqlglot? I think it has some pretty-print options.
- Maybe something like this? https://duckdb.org/2024/10/02/pyodide.html
- Probably also worth mentioning that DuckDB can interact with SQLite dbs.
https://duckdb.org/docs/extensions/sqlite.html https://duckdb.org/docs/guides/database_integration/sqlite.h...
Thus potentially making duckdb an HTAP-like option.
- Probably not exactly what you mean or asking for, but the work Motherduck is doing looks promising.
https://motherduck.com/blog/differential-storage-building-bl...
Hopefully it finds its way into duckdb's repo some day.
- Ibis looks very promising.
There are also other ways to use both.
https://duckdb.org/docs/guides/python/polars.html
All of this dataframe compatibility is awesome. (much thanks to Arrow and others)
- Perhaps not exactly what you're talking about, but maybe? (unsure bc the with statements are sometimes called "temp tables")
https://duckdb.org/docs/sql/query_syntax/with#cte-materializ...
Obviously, the materialization is gone after the query has ended, but still a very powerful and useful directive to add to some queries.
There are also a few DuckDB extensions for pipeline SQL languages.
https://duckdb.org/community_extensions/extensions/prql.html
https://duckdb.org/community_extensions/extensions/psql.html
And of course dbt-duckdb https://github.com/duckdb/dbt-duckdb
- Although only experimental and probably off topic to the discussion, it's worth mentioning DuckDB also provides a Spark API implementation.
https://duckdb.org/docs/api/python/spark_api
And while on the subject of syntax, duckdb also has function chaining
https://duckdb.org/docs/sql/functions/overview.html#function...
- Not the original parent, so unsure of their use-case. But I've seen the approach where some/basic development can be done on duckdb, before making its way to dev/qa/prd.
Something like your project might enable grabbing data (subsets of data) from a dev enviroment (seed) for offline, cheap (no SF warehouse cost) development/unit-testing, etc.
- Real awesome project. This would be even better if some of the differential storage functionality makes its way into duckdb.
https://motherduck.com/blog/differential-storage-building-bl...
Unsure if it's even possible, but if there was a way to write out (simulate) the streams to WAL files, you might be able to accomplish the same without having to consolidate the duckdb file every time.
A couple of other ideas, that may or may not diverge from your project's purpose or simplicity...
It's also too bad that duckdb isn't included in the out-of-the-box Snowpark packages, potentially allowing you to run all of this in a Snowflake procedure/DAG and persisting the duckdb file to object storage. You could of course achieve this with Snowflake containers. (But this would probably ruin any plans to support Redshift/BigQuery/etc as sources)
If the source tables were Iceberg, then you could have an even simpler duckdb file to produce/maintain that would just wrap views around the Iceberg tables using the duckdb iceberg extension.
If you're going to enable change tracking on tables then perhaps another solution is to use CHANGES to select the data from the table instead of reading streams. That way you could use the same timestamp across all the selects and get better consistence between your materializations to duckdb.create view x as select * from iceberg_scan('metadata_file');
Here's another similar (but different) project that I've been tracking https://github.com/buremba/universqlset ts = {last_materialization_ts}; select * from X CHANGES AT(timestamp=>$ts); select * from Y CHANGES AT(timestamp=>$ts); -- triggered task to kick off materialization to duckdb create task M when SYSTEM$HAS_DATA(X_STREAM) AND SYSTEM$HAS_DATA(Y_STREAM) AND ... AS CALL MY_MATERIALIZE(); -- send SNS, invoke ext-func, etc - Perhaps similar to https://github.com/duckdb/dbt-duckdb , but SQLMesh instead of DBT obviously.
It should be extremely simple for databases that support ADBC (for example Snowflake, PostgreSQL).
For others it might just be a matter of mapping DDL, DML, DQL, etc to a supported database protocol driver (JDBC, ODBC, etc). Of course this is where things may get challenging as it would become the responsibility of your server to convert result to Arrows (tables/streams/etc). But could potentially be delegated to "worker" Flight servers (not a Flight SQL server) and then the server could return/forward their Arrow results (Flight results).
Of course some of this is to some degree already possible through DuckDB's MySQL/Postgres Extensions.
I imagine this could also be useful for developing/testing locally?
It might also provide a way to interchange databases while potentially easing database migrations (vendor to vendor) if ADBC isn't supported by the vendor.
Another potential value-addition could be to provide SQL dialect management by providing Substrait conversions (or sqlglot but looks like the server is Java, so unsure if possible, maybe Graal?).