Comment by kblissett - Hacker Neue

kblissett 3 days ago parent

One of the big advantages people enjoy is the elimination of the network latency between the application server and the DB. With SQLite your DB is right there often directly attached over NVME. This improves all access latencies and even enables patterns like N+1 queries which would typically be considered anti-patterns in other DBs.

pbowyer 3 days ago

> One of the big advantages people enjoy is the elimination of the network latency between the application server and the DB. With SQLite your DB is right there often directly attached over NVME.

You can install MySQL/PostgreSQL on the application server, connect over a unix socket and get the same benefits as if you'd used SQLite on the application server (no network latency, fast queries). Plus the other benefits that come from using these database servers (Postgres extensions, remote connections, standard tooling etc). I'm guessing more RAM is required on the application server than if you used SQLite but I haven't benchmarked it.

bob1029 3 days ago

Unix sockets don't actually give you the same benefit. You're still doing IPC which can incur substantial memory subsystem utilization. SQLite is on the same thread/core as whatever is using it.

pbowyer 1 day ago

TIL. Thanks!

turnsout 3 days ago

Real talk, how do you actually avoid N+1? I realize you can do complicated JOINs, but isn't that almost as bad from a performance perspective? What are you really supposed to do if you need to, e.g. fetch a list of posts along with the number of comments on each post?

simonw 3 days ago

Often you can use joins to get the data in a single complex SQL query. Number of comments for a post is relatively straight-forward, but you can also do increasingly complex associated data fetches with modern databases.

In particular, JSON aggregations mean you can have a single query that does things like fetch a blog entry and the earliest 10 comments in a single go. I wrote up some patterns for doing that in SQLite and PostgreSQL here: https://github.com/simonw/til/blob/main/sqlite/related-rows-...

Here's an example PostgreSQL query that does this to fetch tags for posts: https://simonwillison.net/dashboard/json-agg-example

  select
    blog_entry.id,
    title,
    slug,
    created,
    coalesce(json_agg(json_build_object(blog_tag.id, blog_tag.tag)) filter (
      where
        blog_tag.tag is not null
    ), json_build_array()) as tags
  from
    blog_entry
    left join blog_entry_tags on blog_entry.id = blog_entry_tags.entry_id
    left join blog_tag on blog_entry_tags.tag_id = blog_tag.id
  group by
    blog_entry.id
  order by
    blog_entry.id desc

The alternative, more common path is the pattern that Django calls "prefetch_related". Effectively looks like this:

  select id, title, created from posts order by created desc limit 20

  -- Now extract the id values from that and run:

  select
    blog_entry.id,
    blog_tag.tag
  from
    blog_entry
    join blog_entry_tags on blog_entry.id = blog_entry_tags.entry_id
    join blog_tag on blog_entry_tags.tag_id = blog_tag.id
  where
    blog_entry.id in (?, ?, ?, ...)

  -- Now you can re-assemble the list of tags for
  -- each entry in your application logic

Once you have a list of e.g. 20 IDs you can run a bunch of cheap additional queries to fetch extra data about all 20 of those items.

crazygringo 3 days ago

> I realize you can do complicated JOINs, but isn't that almost as bad from a performance perspective?

No, JOINs should be orders of magnitude faster.

> What are you really supposed to do if you need to, e.g. fetch a list of posts along with the number of comments on each post?

You're really supposed to do a JOIN, together with a GROUP BY and a COUNT(). This is elementary SQL.

joevandyk 3 days ago

it gets more complicated when you need to also display something like "last comment: <author> <3 days ago>" for each post, or if the comment counts need to be filtered by various flags/states/etc.

of course, it's all possible with custom SQL but it gets complicated quick.

crazygringo 3 days ago

That's like saying it gets more complicated when you have to use loops with break statements in programming. It's just what programming is.

The filtering you describe is trivial with COUNT(flag IN (...) AND state=...) etc.

If you want to retrieve data on the last comment, as opposed to an aggregate function of all comments, you can do that with window functions (or with JOIN LATERAL for something idiomatic specifically to Postgres).

Learning how to do JOIN's in SQL is like learning pointers in C -- in that it's a basic building block of the language. Learning window functions is like learning loops. These are just programming basics.

joevandyk 3 days ago

Yes, all that’s possible. But it’s not straightforward in the majority of ORMs.

2 More Comments →

simjnd 3 days ago

AFAIK the problem of N+1 isn't necessarily one more DB query, but one more network roundtrip. So if for each page of your app you have an API endpoint that provides exactly all of the data required for that page, it doesn't matter how many DB queries your API server makes to fulfill that request (provided that the API server and the DB are on the same machine).

This is essentially what GraphQL does instead of crafting each of these super tailored API endpoints for each of your screens, you use their query language to ask for the data you want, it queries the DB for you and get you the data back in a single network roundtrip from the user perspective.

(Not an expert, so I trust comments to correct what I got wrong)

jerrygenser 3 days ago

You still have to write the resolver for graphql. I've seen. N+1 with graphql if you don't actually use data loader+batch pattern OR if you use it incorrectly.

toast0 3 days ago

A proper join is the right answer. But, it's not always possible to make those run well. [1] A "client side join" in the right situation can be much better, but then you probably want to do a 1+1 rathet than N+1. Do the first query to get the ids for the second query, and then construct the second query with IN or UNION depending on what works best for you database. UNION likely bloats your query string, but I've seen plenty of situations where UNION is gobs faster than IN.

Alternately, if you can separate query issuance from result parsing, you can make N+1 palletable. Ex, do your query to get the ids, wait for the results, loop and issue the N queries, then loop and wait for results in order. That will be two-ish round trips rather than N+1 round trips. But you have to search to find database apis that allow that kind of separation.

[1] You can almost always express the query you want in SQL, but that doesn't mean it will have a reasonable runtime. Sometimes server side join and client side join have about the same runtime... if it's significant and you have the usual case where clients are easier to scale than database servers, it might be worthwhile to have the join run on the client to reduce server load.

immibis 3 days ago

The performance problem in N+1 is (mostly) not in fetching the N rows from disk, but rather from multiplying the network latency by a factor of N. Joins solve this; so do stored procedures.

In general, you want to ask the remote server once for all the data you need, then read all the results. It applies to databases as well as APIs.

Pipelined requests also solve the problem and can be more flexible.

Also, joins can be optimised in different ways. Sometimes the optimal way to do a join isn't to query each row one-by-one, but to do something like (when the rows you want are a large fraction of the rows that exist) making a bloom filter of the rows you want and then sequentially reading all the rows in the table.

t0mas88 3 days ago

A JOIN is fast, fetching the whole list in one extra query with "WHERE id IN (...)" is also pretty fast and results in less complex queries if you have serval of these. Doing all queries separate is slow because of the network round-trip for each query.

vhcr 3 days ago

No, JOINs are pretty much always faster than performing N+1 queries.

jamie_ca 3 days ago

Either joins for a fat query, or aggregate the subqueries.

For the latter, it's along the lines of `select * from posts where ...` and `select * from authors where id in {posts.map(author_id)}`. And then once it's in memory you manually work out the associations (or rely on your ORM to do it).

erpellan 3 days ago

You do indeed use JOINS. The goal is to retrieve exactly the data you require in a single query. Then you get the DB to `EXPLAIN VERBOSE` or similar and ensure that full table scans aren't happening and that you have indexed the columns the query is being filtered on.

asalahli 3 days ago

Avoiding N+1 doesn't have to mean limiting yourself to 1 query. You can still fetch the posts in one query and the comments of _all_ posts in a separate query, just don't issue a query for _each_ post.

More formally, the number of queries should be constant and not linearly scaling with the number of rows you're processing.

bccdee 3 days ago

The actual thing that we're getting N+1 of is network round-trips. An additional network round-trip is way, way slower than an extra JOIN clause. That's why N+1 query patterns aren't a problem when you're using a local database: There's no round-trip.

ComputerGuru 3 days ago

A well-written JOIN against a well-designed database (regardless if we're talking postgres, SQLite, MySQL/MariaDB, or MS SQL) should not be slow. If it's slow, you're using it wrong.

This item has no comments currently.