Profile: mgerdts - Hacker Neue

mgerdts

Joined Aug 20, 2016 1,416 karma

Software engineer specializing in operating systems and the backend layers closest to the OS. Previously a sysadmin.

mgerdts 2 days ago

This article is a great read explaining how this trap happens.
https://www.yesigiveafig.com/p/part-1-my-life-is-a-lie
mgerdts 4 days ago

Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).
With SSDs, the write pattern is very important to read performance.
Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.
Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.
If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.
If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.
At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.
mgerdts Dec 16, 2025

I just stumbled across this:
> Native NVMe is now generally available (GA) with an opt-in model (disabled by default as of October’s latest cumulative update for WS2025).
https://www.elevenforum.com/t/announcing-native-nvme-in-wind...
mgerdts Dec 15, 2025

This article is talking about SATA SSDs, not HDDs. While the NVMe spec does allow for MVMe HDDs, it seems silly to waste even one PCIe lane on a HDD. SATA HDDs continue to make sense.
mgerdts Dec 7, 2025

In addition to my other comments about parallel IO and unbuffered IO, be aware that WS2022 has (had?) a rather slow NVMe driver. It has been improved in WS2025.
mgerdts Dec 7, 2025

Robocopy has options for unbuffered IO (/J) and parallel operations (/MT:N) which could make it go much faster.
Performing parallel copies is probably the big win with less than 10 Gb/s of network bandwidth. This will allow SMB multichannel to use multiple connections, hiding some of the slowness you can get with a single TCP connection.
When doing more than 1-2 GB/s of IO the page cache can start to slow IO down. That’s when unbuffered (direct) IO starts to show a lot of benefit.
mgerdts Nov 27, 2025

A workload that uses only a fraction of such system can be corralled onto a single socket or portion thereof and use local memory through the use of cgroups.
Most likely other workloads will also run on this machine. They can be similarly bound to meet their needs.
With kubernetes, CPU manager can be a big help.
mgerdts Nov 26, 2025

> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.
I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.
> I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations
I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.
That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.
mgerdts Nov 26, 2025

This article misses several important points.
- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.
- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.
- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.
- per the NVMe spec there are indicators of drive health in the SMART log page.
- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.
mgerdts Nov 25, 2025

Is there a market for ridiculously loud generators?
mgerdts Nov 25, 2025

I’ve had many very kind people help me throughout my life. In most cases there was no clear immediate reward for them. Maybe there was immediate return - joy in sharing one’s craft or the satisfaction of passing on good will they received sometime long ago.
Every meaningful project I’ve worked on has benefited more from inclusion than exclusion. The person I help may or may not become a significant contributor to my project, but many times they become the person that can help me with something I’m learning. And so what if I never run across that person again? Maybe they will remember the kindness they received and pass it along.
mgerdts Nov 24, 2025

I don’t think that having worked at Sun gives you much of a leg up on Triton (cloud platform). Running Triton does require specialized knowledge, but there are decent docs, IRC, and commercial support available.
Triton uses SmartOS as the operating system on compute nodes. Familiarity with Solaris/illumos is helpful at that layer. If you are Using it to run Linux VMs, the amount of Solaris wizardry needed should be minimal.
mgerdts Nov 24, 2025

I have severe hearing loss in my right ear and no to mild hearing loss in the left. AirPods Pro 2 make it so that I feel like I can hear in stereo while streaming without resorting to setting the balance 90% right and jacking the volume. In that respect I love them. However, they are designed only for moderate loss so they will not amplify the right ear sufficiently to hear well in that ear unless the left ear is uncomfortably loud.
For me, I need a real hearing aid to hear a person that is at my right shoulder.
If both ears are about the same, I think the hearing aid volume (separate slider from general volume) could be adjusted to get past the “designed for moderate loss” limitation.
mgerdts Nov 21, 2025

At a minimum, these folks do.
https://mnx.io/
They like it enough that they bought this business from Samsung, who previously developed and supported it through their subsidiary, Joyent. I worked for Joyent for a few years but left before the transition to mnx.
mgerdts Nov 13, 2025

A drive that supports Secure Instant Erase should be encrypting all data. When the SEI function is invoked (“nvme format -s 2”, “hdparm —-security-erase”) they key is thrown away and replaced with a new one. Similar implementations exist for NVMe, SATA, and SAS drives — regardless of whether they are HDD or SSD.
This puts a fair amount of trust and in the drive’s ability to really delete the old key.
mgerdts Nov 13, 2025

I think it is solidigm that has started to argue that with a 128 TB QLC drive constant writes at the maximum write rate will hit the drive’s endurance limit at about 4.6 years. The perf/TB of these drives is better than HDDs. The cost per TB when you factor in server count, switches, power, etc., is argued to favor huge QLC drives too.
mgerdts Nov 8, 2025

In ye olde days, drivers wrestling with maps were also criticized as being distracted and dangerous. Having been a driver in ye olde days in situations where a map was needed, I can confidently say that GPS initiated while stopped and used throughout a trip is far safer than the driver using a paper map while driving.
mgerdts Nov 7, 2025

For those who have no idea what Plus1bus is, it is probably Plur1bus. The title in this submission needs to be fixed.
https://en.wikipedia.org/wiki/Pluribus_(TV_series)
mgerdts Nov 3, 2025

The other way to look at it is that it’s a waste of RAM to put even a minimal boot disk in RAM. It really depends on which you value more.
The biggest value of this approach is that you are sure that every boot runs the latest OS image. That could be accomplished in other ways.
I say this as the personal opinion of the former engineering lead for the team that maintained SmartOS at Joyent.
mgerdts Oct 24, 2025

Soapy water (dish soap) in a spray bottle works wonders. Once they are wet and bubbly they can’t fly, making it safe to knock them to the ground and squish them.
If you happen to have the spray bottle in hand while they are flying at you, a quick mist in the air in their flight path will turn them away.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous