Comment by ch33zer - Hacker Neue

ch33zer Aug 9, 2025 parent

Since the limit you ran into was number of open files could you just raise that limit? I get blocking the spammy traffic but theoretically could you have handled more if that limit was upped?

hyperknot Aug 9, 2025

I've just written my question to the nginx community forum, after a lengthy debugging session with multiple LLMs. Right now, I believe it was the combination of multi_accept + open_file_cache > worker_rlimit_nofile.

https://community.nginx.org/t/too-many-open-files-at-1000-re...

Also, the servers were doing 200 Mbps, so I couldn't have kept up _much_ longer, no matter the limits.

toast0 Aug 9, 2025

I'm pretty sure your open file cache is way too large. If you're doing 1k/sec, and you cache file descriptors for 60 minutes, assuming those are all unique, that's asking for 3 million FDs to be cached, when you've only got 1 million available. I've never used nginx or open_file_cache[1], but I would tune it way down and see if you even notice a difference in performance in normal operation. Maybe 10k files, 60s timeout.

> Also, the servers were doing 200 Mbps, so I couldn't have kept up _much_ longer, no matter the limits.

For cost reasons or system overload?

If system overload ... What kind of storage? Are you monitoring disk i/o? What kind of CPU do you have in your system? I used to push almost 10GBps with https on dual E5-2690 [2], but it was a larger file. 2690s were high end, but something more modern will have much better AES acceleration and should do better than 200 Mbps almost regardless of what it is.

[1] to be honest, I'm not sure I understand the intent of open_file_cache... Opening files is usually not that expensive; maybe at hundreds of thousands of rps or if you have a very complex filesystem. PS don't put tens of thousands of files in a directory. Everything works better if you take your ten thousand files and put one hundred files into each of one hundred directories. You can experiment to see what works best with your load, but a tree where you've got N layers of M directories and the last layer has M files is a good plan, 64 <= M <= 256. The goal is keeping the directories compact so searching and editing is effective.

[2] https://www.intel.com/content/www/us/en/products/sku/64596/i...

CoolCold Aug 11, 2025

> [1] to be honest, I'm not sure I understand the intent of open_file_cache... Opening files is usually not that expensive

I may have a hint here - remember, that Nginx was created in the times of dialup was a thing yet and having single Pentium 3 server was a norm (I believe I've seen myself that wwwXXX machines in the Rambler DCs over that time).

So my a bit educated guess here, that saving every syscal was sorta ultimate goal and it was more efficient in terms of at least latency by that times. You may take a look how Nginx parses http methods (GET/POST) to save operations.

Myself I don't remember seeing large benefits of using open_file_cache, but I likely never did a proper perf test here. Say ensure use of sendfile/buffers/TLS termination made much more influence for me on modern (10-15 years old) HW.

Aeolun Aug 10, 2025

If you do 200Mbps on a hetzner server after cloudflare caching, you are going to run out of traffic pretty rapidly. The limit is 20TB / month (which you’d reach in roughly 9 days).

CoolCold Aug 11, 2025

You are probably talking about VMs - those do have traffic limits. Servers, on the other side, with default 1Gbit NICs doesn't (let's say until you consume 80%+ of bandwidth for months)

Quoting:

> Traffic

>All root servers have a dedicated 1 GBit uplink by default and with it unlimited traffic. Inclusive monthly traffic for servers with 10G uplink is 20TB. There is no bandwidth limitation. We will charge € 1 ($1.20)/TB for overusage.

Aeolun Aug 12, 2025

Huh, this must have changed after I concluded my contract with them (several years ago).

Huh, archive.org tells me its been unlimited since at least 5 years ago, so I guess I must’ve just seen someone mention 20TB and felt it was a reasonsble limit :)

johnisgood Aug 10, 2025

One would think services like these do not have to rely on online services and have their own rack of servers. Or is this so alien these days?

MaKey Aug 10, 2025

Small addition: That limit applies to Hetzner Cloud servers, their dedicated servers have unlimited traffic.

Aeolun Aug 10, 2025

Depends on your connection I think. Mine do 1Gbit/sec but have a 20TB limit. The 100Mbit ones are unlimited (last I checked)

MaKey Aug 12, 2025

Yes, it does. 1 GBit/s dedicated servers have unlimited traffic, 10 GBit/s have 20 TB of traffic included. I'm not aware of a 100 MBit/s offer. Not sure which offer you are using but it sounds like a Hetzner cloud server not a dedicated server.

ndriscoll Aug 9, 2025

One thing that might work for you is to actually make the empty tile file, and hard link it everywhere it needs to be. Then you don't need to special case it at runtime, but instead at generation time.

NVMe disks are incredibly fast and 1k rps is not a lot (IIRC my n100 seems to be capable of ~40k if not for the 1 Gbit NIC bottlenecking). I'd try benchmarking without the tuning options you've got. Like do you actually get 40k concurrent connections from cloudflare? If you have connections to your upstream kept alive (so no constant slow starts), ideally you have numCores workers and they each do one thing at a time, and that's enough to max out your NIC. You only add concurrency if latency prevents you from maxing bandwidth.

hyperknot Aug 9, 2025

Yes, that's a good idea. But we are talking about 90+% of the titles being empty (I might be wrong on that), that's a lot of hard links. I think the nginx config just need to be fixed, I hope I'll receive some help on their forum.

ndriscoll Aug 9, 2025

You could also try turning off the file descriptor cache. Keep in mind that nvme ssds can do ~30-50k random reads/second with no concurrency, or at least hundreds of thousands with concurrency, so even if every request hit disk 10 times it should be fine. There's also kernel caching which I think includes some of what you'd get from nginx's metadata cache?

justinclift Aug 10, 2025

> so I couldn't have kept up _much_ longer, no matter the limits.

Why would that kind of rate cause a problem over time?

This item has no comments currently.