Comment by jacobwg - Hacker Neue

jacobwg Mar 28, 2025 parent

A list of fun things we've done for CI runners to improve CI:

- Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

- Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)

- "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]

- Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)

- Configuring Docker with containerd/estargz support

- Just generally turning kernel options and unit files off that aren't needed

[0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initial...

3np Mar 28, 2025

> Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)

Are you not using a caching registry mirror, instead pulling the same image from Hub for each runner...? If so that seems like it would be an easy win to add, unless you specifically do mostly hot/unique pulls.

The more efficient answer to those rate limits is almost always to pull less times for the same work rather than scaling in a way that circumvents them.

jacobwg OP Mar 28, 2025

Today we (Depot) are not, though some of our customers configure this. For the moment at least, the ephemeral public IP architecture makes it generally unnecessary from a rate-limit perspective.

From a performance / efficiency perspective, we generally recommend using ECR Public images[0], since AWS hosts mirrors of all the "Docker official" images, and throughput to ECR Public is great from inside AWS.

[0] https://gallery.ecr.aws/

glenjamin Mar 28, 2025

If you’re running inside AWS us-east-1 then docker hub will give you direct S3 URLs for layer downloads (or it used to anyway)

Any pulls doing this become zero cost for docker hub

Any sort of cache you put between docker hub and your own infra would probably be S3 backed anyway, so adding another cache in between could be mostly a waste

jacobwg OP Mar 28, 2025

Yeah we do some similar tricks with our registry[0]: pushes and pulls from inside AWS are served directly from AWS for maximum performance and no data transfer cost. Then when the client is outside AWS, we redirect all that to Tigris[1], also for maximum performance (CDN) and minimum data transfer cost (no cost from Tigris, just the cost to move content out of AWS once).

[0]: https://depot.dev/blog/introducing-depot-registry

[1]: https://www.tigrisdata.com/blog/depot-registry/

philsnow Mar 28, 2025

> Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

I'm slightly old; is that the same thing as a ramdisk? https://en.wikipedia.org/wiki/RAM_drive

jacobwg OP Mar 28, 2025

Exactly, a ramdisk-backed writeback cache for the root volume for Linux. For macOS we wrote a custom nbd filter to achieve the same thing.

philsnow Mar 28, 2025

Forgive me, I'm not trying to be argumentative, but doesn't Linux (and presumably all modern OSes) already have a ram-backed writeback cache for filesystems? That sounds exactly like the page cache.

jacobwg OP Mar 28, 2025

No worries, entirely valid question. There may be ways to tune page cache to be more like this, but my mental model for what we've done is effectively make reads and writes transparently redirect to the equivalent of a tmpfs, up to a certain size. If you reserve 2GB of memory for the cache, and the CI job's read and written files are less than 2GB, then _everything_ stays in RAM, at RAM throughput/IOPS. When you exceed the limit of the cache, blocks are moved to the physical disk in the background. Feels like we have more direct control here than page cache (and the page cache is still helping out in this scenario too, so it's more that we're using both).

philsnow Mar 28, 2025

> reads and writes transparently redirect to the equivalent of a tmpfs, up to a certain size

The last bit (emphasis added) sounds novel to me, I don't think I've heard before of anybody doing that. It sounds like an almost-"free" way to get a ton of performance ("almost" because somebody has to figure out the sizing. Though, I bet you could automate that by having your tool export a "desired size" metric that's equal to the high watermark of tmpfs-like storage used during the CI run)

3np Mar 29, 2025

Just to add, my understanding is that unless you also tune your workload writes, the page cache will not skip backing storage for writes, only for reads. So it does make sense to stack both if you're fine with not being able to rely on peristence of those writes.

nine_k Mar 28, 2025

No, it's more like swapping pages to disk when RAM is full, or like using RAM when the L2 cache is full.

Linux page cache exists to speed up access to the durable store which is the underlying block device (NVMe, SSD, HDD, etc).

The RAM-backed block device in question here is more like tmpfs, but with an ability to use the disk if, and only if, it overflows. There's no intention or need to store its whole contents on the durable "disk" device.

Hence you can do things entirely in RAM as long as your CI/CD job can fit all the data there, but if it can't fit, the job just gets slower instead of failing.

trillic Mar 28, 2025

If you clearly understand your access patterns and memory requirements, you can often outperform the default OS page cache.

Consider a scenario where your VM has 4GB of RAM, but your build accesses a total of 6GB worth of files. Suppose your code interacts with 16GB of data, yet at any moment, its active working set is only around 2GB. If you preload all Docker images at the start of your build, they'll initially be cached in RAM. However, as your build progresses, the kernel will begin evicting these cached images to accommodate recently accessed data, potentially even files used infrequently or just once. And that's the key bit, to force caching of files you know are accessed more than once.

By implementing your own caching layer, you gain explicit control, allowing critical data to remain persistently cached in memory. In contrast, the kernel-managed page cache treats cached pages as opportunistic, evicting the least recently used pages whenever new data must be accommodated, even if this new data isn't frequently accessed.

philsnow Mar 28, 2025

> If you clearly understand your access patterns and memory requirements, you can often outperform the default OS page cache.

I believe various RDBMSs bypass the page cache and use their own strategies for managing caching if you give them access to raw block devices, right?

inkyoto Mar 28, 2025

That is true and correct, except that Linux does not have raw devices, and O_DIRECT on a file is not a complete replacement for the raw devices (the buffer cache still gets involved as well as the file system).

seanlaff Mar 29, 2025

The ramdisk that overflows to a real disk is a cool concept that I didn't previously consider. Is this just clever use of bcache? If you have any docs about how this was set up I'd love to read them.

yencabulator Mar 29, 2025

> - Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

Everyone Linux kernel does that already. I currently have 20 GB of disk cached in RAM on this laptop.

jiocrag Mar 28, 2025

have you tried Buildkite? https://buildkite.com

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous