Preferences

mgerdts
Joined 1,416 karma
Software engineer specializing in operating systems and the backend layers closest to the OS. Previously a sysadmin.

  1. This article is a great read explaining how this trap happens.

    https://www.yesigiveafig.com/p/part-1-my-life-is-a-lie

  2. Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).

    With SSDs, the write pattern is very important to read performance.

    Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.

    Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.

    If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.

    If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.

    At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.

  3. I just stumbled across this:

    > Native NVMe is now generally available (GA) with an opt-in model (disabled by default as of October’s latest cumulative update for WS2025).

    https://www.elevenforum.com/t/announcing-native-nvme-in-wind...

  4. This article is talking about SATA SSDs, not HDDs. While the NVMe spec does allow for MVMe HDDs, it seems silly to waste even one PCIe lane on a HDD. SATA HDDs continue to make sense.
  5. In addition to my other comments about parallel IO and unbuffered IO, be aware that WS2022 has (had?) a rather slow NVMe driver. It has been improved in WS2025.
  6. Robocopy has options for unbuffered IO (/J) and parallel operations (/MT:N) which could make it go much faster.

    Performing parallel copies is probably the big win with less than 10 Gb/s of network bandwidth. This will allow SMB multichannel to use multiple connections, hiding some of the slowness you can get with a single TCP connection.

    When doing more than 1-2 GB/s of IO the page cache can start to slow IO down. That’s when unbuffered (direct) IO starts to show a lot of benefit.

  7. A workload that uses only a fraction of such system can be corralled onto a single socket or portion thereof and use local memory through the use of cgroups.

    Most likely other workloads will also run on this machine. They can be similarly bound to meet their needs.

    With kubernetes, CPU manager can be a big help.

  8. > You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.

    I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.

    I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.

    > I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations

    I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.

    That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.

  9. This article misses several important points.

    - Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.

    - Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.

    - A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.

    - Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.

    - per the NVMe spec there are indicators of drive health in the SMART log page.

    - Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.

  10. Is there a market for ridiculously loud generators?
  11. I’ve had many very kind people help me throughout my life. In most cases there was no clear immediate reward for them. Maybe there was immediate return - joy in sharing one’s craft or the satisfaction of passing on good will they received sometime long ago.

    Every meaningful project I’ve worked on has benefited more from inclusion than exclusion. The person I help may or may not become a significant contributor to my project, but many times they become the person that can help me with something I’m learning. And so what if I never run across that person again? Maybe they will remember the kindness they received and pass it along.

  12. I don’t think that having worked at Sun gives you much of a leg up on Triton (cloud platform). Running Triton does require specialized knowledge, but there are decent docs, IRC, and commercial support available.

    Triton uses SmartOS as the operating system on compute nodes. Familiarity with Solaris/illumos is helpful at that layer. If you are Using it to run Linux VMs, the amount of Solaris wizardry needed should be minimal.

  13. I have severe hearing loss in my right ear and no to mild hearing loss in the left. AirPods Pro 2 make it so that I feel like I can hear in stereo while streaming without resorting to setting the balance 90% right and jacking the volume. In that respect I love them. However, they are designed only for moderate loss so they will not amplify the right ear sufficiently to hear well in that ear unless the left ear is uncomfortably loud.

    For me, I need a real hearing aid to hear a person that is at my right shoulder.

    If both ears are about the same, I think the hearing aid volume (separate slider from general volume) could be adjusted to get past the “designed for moderate loss” limitation.

  14. At a minimum, these folks do.

    https://mnx.io/

    They like it enough that they bought this business from Samsung, who previously developed and supported it through their subsidiary, Joyent. I worked for Joyent for a few years but left before the transition to mnx.

  15. A drive that supports Secure Instant Erase should be encrypting all data. When the SEI function is invoked (“nvme format -s 2”, “hdparm —-security-erase”) they key is thrown away and replaced with a new one. Similar implementations exist for NVMe, SATA, and SAS drives — regardless of whether they are HDD or SSD.

    This puts a fair amount of trust and in the drive’s ability to really delete the old key.

  16. I think it is solidigm that has started to argue that with a 128 TB QLC drive constant writes at the maximum write rate will hit the drive’s endurance limit at about 4.6 years. The perf/TB of these drives is better than HDDs. The cost per TB when you factor in server count, switches, power, etc., is argued to favor huge QLC drives too.
  17. In ye olde days, drivers wrestling with maps were also criticized as being distracted and dangerous. Having been a driver in ye olde days in situations where a map was needed, I can confidently say that GPS initiated while stopped and used throughout a trip is far safer than the driver using a paper map while driving.
  18. For those who have no idea what Plus1bus is, it is probably Plur1bus. The title in this submission needs to be fixed.

    https://en.wikipedia.org/wiki/Pluribus_(TV_series)

  19. The other way to look at it is that it’s a waste of RAM to put even a minimal boot disk in RAM. It really depends on which you value more.

    The biggest value of this approach is that you are sure that every boot runs the latest OS image. That could be accomplished in other ways.

    I say this as the personal opinion of the former engineering lead for the team that maintained SmartOS at Joyent.

  20. Soapy water (dish soap) in a spray bottle works wonders. Once they are wet and bubbly they can’t fly, making it safe to knock them to the ground and squish them.

    If you happen to have the spray bottle in hand while they are flying at you, a quick mist in the air in their flight path will turn them away.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal