Preferences

Just a note - you can very much limit cpu usage on the docker containers by setting --cpus="0.5" (or cpus:0.5 in docker compose) if you expect it to be a very lightweight container, this isolation can help prevent one roudy container from hitting the rest of the system regardless of whether it's crypto-mining malware, a ddos attempt or a misbehaving service/software.

Another is running containers in read-only mode, assuming they support this configuration... will minimize a lot of potential attack surface.
Never looked into this. I would expect the majority of images would fail in this configuration. Or am I unduly pessimistic?
Many fail if you do it without any additional configuration. In Kubernetes you can mostly get around it by mounting `emptyDir` volumes to the specific directories that need to be writable, `/tmp` being a common culprit. If they need to be writable and have content that exists in the base image, you'd usually mount an emptyDir to `/tmp` and copy the content into it in an `initContainer`, then mount the same `emptyDir` volume to the original location in the runtime container.

Unfortunately, there is no way to specify those `emptyDir` volumes as `noexec` [1].

I think the docker equivalent is `--tmpfs` for the `emptyDir` volumes.

1: https://github.com/kubernetes/kubernetes/issues/48912

Readonly and rootless are my two requirements for Docker containers. Most images can't run readonly because they try to create a user in some startup script. Since I want my UIDs unique to isolate mounted directories, this is meaningless. I end up having to wrap or copy Dockerfiles to make them behave reasonably.

Having such a nice layered buildsystem with mountpoints, I'm amazed Docker made readonly an afterthought.

I like steering docker runs with docker-compose, especially with .env files - easy to store in repositories, easy to customise and have sane defaults.
Yeah agreed. I use docker-compose. But it doesn't help if the Docker images try to update /etc/passwd, or force a hardcoded UID, or run some install.sh at runtime instead of buildtime.
It's hit or miss... you sometimes have to make /tmp writable or another data directory... some images just don't operate right because of initialization steps that happen on first run. It's hit or miss and depends... but a lot of your own apps can definitely be made to work with limited, or no write surface.
Depends on specific app use case. Nginx doesn't work with it but valkey will.
This is true, but it's also easy to set at one point and then later introduce a bursty endpoint that ends up throttled unnecessarily. Always a good idea to be familiar with your app's performance profile but it can be easy to let that get away from you.
While this is a good idea I wonder if doing this could allow the intrusion to go undetected for longer - how many people/monitoring systems would notice a small increase in CPU usage compared to all CPUs being maxed out.
Soft and hard memory limits are worth considering too, regardless of container method.
This is a great shout actually. Thanks for pointing it out!
The other thing to note is that docker is for the most part, stateless. So if you're running something that has to deal with questionable user input (images and video or more importantly PDFs), is to stick it on its own VM and then cycle the docker container every hour and the VM every 12, and then still be worried about it getting hacked and leaking secrets.
If I can get in once, I can do it again an hour later. I'd be inclined to believe that dumb recycling is not very effective against a persistent attacker.
I wonder if a crypto miner like this was a person doing the work, or just an automated thing someone wrote to scan IPs for known vulnerabilities and exploit them automatically.
Most of this is mitigated by running docker in an LXC containers (like proxmox does) which grants a lot more isolation than docker on it's own - closer in nature to running separate VMs.
Too bad it straight doesn't work without heavy mods in pve9
Illumos had a really nice stack for running containers inside jails and zones... I wonder if any of that ever made it into the linux world. If you broke out of the container you'd just be inside a jail which is even more hardened.
SmartOS constructed a container-like environment using LX-branded zones, they didn't create an in-kernel equivalent to Linux's namespaces which it then nested in a zone. You're probably thinking of the KVM port to Solaris/illumos, which does run in a zone internally to provide additional protection.

While LX-branded zones were a really cool tech demo, maintaining compatibility with Linux long-term would be incredibly painful and you're bound to find all sorts of horrific bugs in production. I believe that Oxide uses KVM to run their Linux guests.

Linux has always supported nested namespaces and you can run Docker containers inside LXC (or Incus) fairly easily. Note that while it does add some additional protection (in particular, it transparently adds user namespaces which is a critical security feature most people still do not enable in Docker) it is still the same technology as containers and so kernel bugs still pose a similar risk.

Yes it was SmartOS - bcantrill worked on it post-oracle. I remembered Illumos since it was the precursor.

This item has no comments currently.