444
points
We've been hard at work for a few weeks and thought it's time for another update.
In case you missed our first post, PostgresML is an end-to-end machine learning solution, running alongside your favorite database.
This time we have more of a suite offering: project management, visibility into the datasets and the deployment pipeline decision making.
Let us know what you think!
Demo link is on the page, and also here: https://demo.postgresml.org
For anything substantive it seems like a bad idea to run this on your primary store since the last thing you want to do is eat up precious CPU and RAM needed by your OLTP database. But in a data warehouse or similar replicated setup, it seems like a really neat idea.
There’s nothing stopping you from reading database table structures directly into memory in Python or R now. You don’t need an intermediate data store.
I agree that running training on production instances would be a bad idea. First, you need to denormalise data for ML, and secondly you typically don’t want your training data to be constantly changing.
Chances are a DBA wouldn’t consider letting you do data engineering in a live production database anyway, so this really is all academic.
In a way, having to pull things out of the DB and into python is something that requires change attribution on its own since it's a conversion between abstractions which I tend to think of as a lossy process (even if it isn't). This sort of tooling keeps the abstractions localized, so it's much easier to maintain a mental model of models of changes.
I get the concern but sometimes I really just do want a black box regressor or classifier. Model performance monitoring is important, but I don't care about attribution.
Chances are a DBA wouldn’t consider letting you do data engineering in a live production database anyway, so this really is all academic.
Maybe it isn't data engineering, but I'm curious what you'd call using Google's BigQuery ML? "BigQuery ML enables users to create and execute machine learning models in BigQuery by using standard SQL queries."
I haven't used it in production, but I'd use it in a heartbeat if I was on BigQuery.
[0] http://github.com/pachyderm/pachyderm
[1] https://www.crunchydata.com/products/crunchy-bridge/
https://demo.postgresml.org/models/1 https://demo.postgresml.org/models/15
The short term goal would be to expose more metrics from the toolkit.
Can you explane the differences with https://madlib.apache.org/ ? Wouldnt an OLAP db better suited than pg for this kind of workload ?
Does being a postgreSQL module make it compatible with citus, greemplum or timescale ?
OLAP use cases often involve a lot of extra complexity out of the gate, and something we're targeting is to help startups maintain the simplest possible tech stack early on while they are still growing and exploring PMF. At a high enough level, it should just work with any database that supports Postgres extensions, since it's all just tables going into algos, but the devil in big data is always in evaluating the performance tradeoffs for the different workloads. Maybe we'll eventually need an "enterprise" edition.
I've used Madlib in the past and although it was 'successful', the constraint was unfamiliarity with the library from our data scientists, who preferred the classic Python libraries.
This is the most exciting ML related project I've seen in a while, Mainly because the barrier for entry seems low as anyone with PG database could apply a model on them using PostgresML if I understood the premise correctly.
Most of the comments here seems to regarding separating the compute from the database machine which it seems isn't possible right now with PostgresML, But the GitHub reads at the start:
> The system runs Postgres with the pgml-extension installed on port 5433 by default, *just in case you happen to be running Postgres already*:
I think the second part needs to be clarified better, Is it installing PGML extension on a machine running a existing PG database and connecting to it (or) does it mean just starting the postgres session of the PGML docker package?In the end though, it'll be important to have benchmarks for all the key steps in the process, both in terms of memory and compute. Off a hunch, I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.
> I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.
Indeed. Not only memory but also inefficiency related to Python itself. It would be great if feature engineering pipelines can be pushed down to lower layers as well. But for now, the usability of Python is still unparallel.
The animated GIF on your homepage moves a little bit too fast for me to follow.
[1] https://cloud.google.com/bigquery-ml/docs/introduction
There is deeper explanation in the README: https://github.com/postgresml/postgresml
So, this is why I really like this idea and about 3 years ago I seriously thought about starting this thing as well. I went ahead and built a specific data company (so not a tooling one) and now I don't like this idea anymore.
To me this is a lot like proposing: "lets get rid of Rest Apis and Graphql and connect the frontend directly to the DB". (ignoring security issues for a bit).
In frontend: The view you like to display your data is a different one than how it should be saved. Exactly the same in ML, the view your data can be trained / predicted on is a very different than it should be stored.
They are connected, but IMO there always has to be a transformation layer. (and Python is just a much better way to do that transformation, but that's an other story)
Maybe enhance this with a FDW to an external inference process to allow triggering of inference from Postgresql itself.
In the opposite direction of bespoke fully custom models being the norm, I'd like to build more "rails" for ML. Hopefully we can expose enough hyperparams, even for deep learning architectures, and automatically adapt the inputs with configurable transformers so that we cover 90% of the custom use cases out of the box.
Which is part of what is so awesome about PostgreSQL, everyone can build extensions that look and feel native and can do almost anything.
E.g. Postico uses an elephant but not the _same_ elephant - https://eggerapps.at/postico/