Preferences

> Data integrity is the natural expectation humans have from computers

I've said it once, and I'll say it again: the only reason ZFS isn't the norm is because we all once lived through a primordial era when it didn't exist. No serious person designing a filesystem today would say it's okay to misplace your data.

Not long ago, on this forum, someone told me that ZFS is only good because it had no competitors in its space. Which is kind of like saying the heavyweight champ is only good because no one else could compete.


To paraphrase, "ZFS is the worst filesystem, except for all those other filesystems that have been tried from time to time."

It's far from perfect, but it has no peers.

I spent many years stubbornly using btrfs and lost data multiple times. Never once did the redundancy I had supposedly configured actually do anything to help me. ZFS has identified corruption caused by bad memory and a bad CPU and let me know immediately which files were damaged.

> No serious person designing a filesystem today would say it's okay to misplace your data.

Former LimeWire developer here... the LimeWire splash screen at startup was due to experiences with silent data corruption. We got some impossible bug reports, so we created a stub executable that would show a splash screen while computing the SHA-1 checksums of the actual application DLLs and JARs. Once everything checked out, that stub would use Java reflection to start the actual application. After moving to that, those impossible bug reports stopped happening. With 60 million simultaneous users, there were always some of them with silent disk corruption that they would blame on LimeWire.

When Microsoft was offering free Win7 pre-release install ISOs for download, I was having install issues. I didn't want to get my ISO illegally, so I found a torrent of the ISO, and wrote a Python script to download the ISO from Microsoft, but use the torrent file to verify chunks and re-download any corrupted chunks. Something was very wrong on some device between my desktop and Microsoft's servers, but it eventually got a non-corrupted ISO.

It annoys me to no end that ECC isn't the norm for all devices with more than 1 GB of RAM. Silent bit flips are just not okay.

Edit: side note: it's interesting to see the number of complaints I still see from people who blame hard drive failures on LimeWire stressing their drives. From very early on, LimeWire allowed bandwidth limiting, which I used to keep heat down on machines that didn't cool their drives properly. Beyond heat issues that I would blame on machine vendors, failures from write volume I would lay at the feet of drive manufacturers.

Though, I'm biased. Any blame for drive wear that didn't fall on either the drive manufacturers or the filesystem implementers not dealing well with random writes would probably fall at my feet. I'm the one who implemented randomized chunk order downloading in order to rapidly increase availability of rare content, which would increase the number of hard drive head seeks on non-log-based filesystems. I always intended to go back and (1) use sequential downloads if tens of copies of the file were in the swarm, to reduce hard drive seeks and (2) implement randomized downloading of rarest chunks first, rather than the naive randomization in the initial implementation. I say naive, but the initial implementation did have some logic to randomize chunk download order in a way to reduce the size of the messages that swarms used to advertise which peers had which chunks. As it turns out, there were always more pressing things to implement and the initial implementation was good enough.

(Though, really, all read-write filesystems should be copy-on-write log-based, at least for recent writes, maybe having some background process using a count-min-sketch to estimate locality for frequently read data and optimize read locality for rarely changing data that's also frequently read.)

Edit: Also, it's really a shame that TCP over IPv6 doesn't use CRC-32C (to intentionally use a different CRC polynomial than Ethernet, to catch more error patterns) to end-to-end checksum data in each packet. Yes, it's a layering abstraction violation, but IPv6 was a convenient point to introduce a needed change. On the gripping hand, it's probably best in the big picture to raise flow control, corruption/loss detection, retransmission (and add forward error correction) in libraries at the application layer (a la QUIC, etc.) and move everything to UDP. I was working on Google's indexing system infra when they switched transatlantic search index distribution from multiple parallel transatlantic TCP streams to reserving dedicated bandwidth from the routers and blasting UDP using rateless forward error codes. Provided that everyone is implementing responsible (read TCP-compatible) flow control, it's really good to have the rapid evolution possible by just using UDP and raising other concerns to libraries at the application layer. (N parallel TCP streams are useful because they typically don't simultaneously hit exponential backoff, so for long-fat networks, you get both higher utilization and lower variance than a single TCP stream at N times the bandwidth.)

It sounds like a fun comp sci exercise to optimise the algo for randomised block download to reduce disk operations but maintain resilience. Presumably it would vary significantly by disk cache sizes.

It's not my field, but my impression is that it would be equally resilient to just randomise the start block (adjust spacing of start blocks according to user bandwidth?) then let users just run through the download serially; maybe stopping when they hit blocks that have multiple sources and then skipping to a new start block?

It's kinda mindbogglingly to me too think of all the processes that go into a 'simple' torrent download at the logical level.

If AIs get good enough before I die then asking it to create simulations on silly things like this will probably keep me happy for all my spare time!

For the completely randomized algorithm, my initial prototype was to always download the first block if available. After that, if fewer than 4 extents (continuous ranges of available bytes) were downloaded locally, randomly chose any available block. (So, we first get the initial block, and 3 random blocks.) If 4 or more extents were available locally, then always try the block after the last downloaded block, if available. (This is to minimize disk seeks.) If the next block isn't available, then the first fallback was to check the list of available blocks against the list of next blocks for all extents available locally, and randomly choose one of those. (This is to chose a block that hopefully can be the start of a bunch of sequential downloads, again minimizing disk seeks.) If the first fallback wasn't available, then the second fallback was to compute the same thing, except for the blocks before the locally available extents rather than the blocks after. (This is to avoid increasing the number of locally available extents if possible.) If the second fallback wasn't available, then the final fallback was to randomly uniformly pick one of the available blocks.

Trying to extend locally available extents if possible was desirable because peers advertised block availability as pairs of <offset, length>, so minimizing the number of extents minimized network message sizes.

This initial prototype algorithm (1) minimized disk seeks (after the initial phase of getting the first block and 3 other random blocks) by always downloading the block after the previous download, if possible. (2) Minimized network message size for advertising available extents by extending existing extents if possible.

Unfortunately, in simulation this initial prototype algorithm biased availability of blocks in rare files, biasing in favor of blocks toward the end of the file. Any bias is bad for rapidly spreading rare content, and bias in favor of the end of the file is particularly bad for audio and video file types where people like to start listening/watching while the file is still being downloaded.

Instead, the algorithm in the initial production implementation was to first check the file extension against a list of extensions likely to be accessed by the user while still downloading (mp3, ogg, mpeg, avi, wma, asf, etc.).

For the case where the file extension indicates the user is unlikely to access the content until the download is finished (the general case algorithm), look at the number of extents (continuous ranges of bytes the user already has). If the number of extents is less than 4, pick any block randomly from the list of blocks that peers were offering for download. If there are 4 or more extents available locally, for each end of each extent available locally, check the block before it and the block after it to see if they're available for download from peers. If this list of available adjacent blocks is non-empty, then randomly chose one of those adjacent blocks for download. If the list of available adjacent blocks is empty, then uniformly randomly chose from one of the blocks available from peers.

In the case of file types likely to be viewed while being downloaded, it would download from the front of the file until the download was 50% complete, and then randomly either download the first needed block, or else use the previously described algorithm, with the probability of using the previous (randomized) algorithm increasing as the percentage of the download completed increased. There was also some logic to get the last few chunks of files very early in the download for file formats that required information from a file footer in order to start using them (IIRC, ASF and/or WMA relied on footer information to start playing).

Internally, there was also logic to check if a chunk was corrupted (using a Merkle tree using the Tiger hash algorithm). We would ignore the corrupted chunks when calculating the percentage completed, but would remove corrupted chunks from the list of blocks we needed to download, unless such removal resulted in an empty list of blocks needed for download. In this way, we would avoid re-downloading corrupted blocks unless we had nothing else to do. This would avoid the case where one peer had a corrupted block and we just kept re-requesting the same corrupted block from the peer as soon as we detected corruption. There was some logic to alert the user if too many corrupted blocks were detected and give the user options to stop the download early and delete it, or else to keep downloading it and just live with a corrupted file. I felt there should have been a third option to keep downloading until a full-but-corrupt download was had, retry downloading every corrupt block once, and then re-prompt the user if the file was still corrupt. However, this option would have resulted in more wasted bandwidth and likely resulted in more user frustration due to some of them hitting "keep trying" repeatedly instead of just giving up as soon as it was statistically unlikely they were going to get a non-corrupted download. Indefinite retries without prompting the user were a non-starter due to the amount of bandwidth they would waste.

The reason ZFS isn't the norm is because it historically was difficult to set up. Outside of NAS solutions, it's only since Ubuntu 20.04 it has been supported out of the box on any high profile customer facing OS. The reliability of the early versions was also questionable, with high zsys cpu usage and some times arcane commands needed to rebuild pools. Anecdotally, I've had to support lots of friends with zfs issues, never so with other file systems. The data always comes back, it's just that it needs petting.

Earlier, there used to be lot of fears around the license, with Torvalds advising against its use, both for that reason and for lack of maintainers. Now i believe that has been mostly ironed out and should be less of an issue.

> The reason ZFS isn't the norm is because it historically was difficult to set up. Outside of NAS solutions, it's only since Ubuntu 20.04 it has been supported out of the box on any high profile customer facing OS.

In this one very narrow sense, we are agreed, if we are talking about Linux on root. IMHO it should also have been virtually everywhere else. It should have been in MacOS, etc.

However, I think your particular comment may miss the forest for the trees. Yes, ZFS was difficult to set up for Linux, because Linux people disfavored its use (which you do touch upon later).

People sometimes imagine that purely technical considerations govern the technical choices of remote groups. However, I think when people say "all tech is political" in the cultural-war-ing American politics sense, they may be right, but they are absolutely right in the small ball open source politics sense.

Linux communities were convinced not to include or build ZFS support. Because licensing was a problem. Because btrfs was coming and would be better. Because Linus said ZFS was mostly marketing. So they didn't care to build support. Of course, this was all BS or FUD or NIH, but it was what happened, not that ZFS had new and different recovery tool, or was less reliable in the arbitrary past. It was because the Linux community engaged in its own (successful) FUD campaign against another FOSS project.

> The reason ZFS isn't the norm is because it historically was difficult to set up.

Has this changed ? ZFS comes with a BSD view of the world (i.e slices). It also needed a sick amount of RAM to function properly.

Was there any change in the license that made you believe that it should be less than a issue?

Or do you think people simply stopped paying attention?

Canonical took a team of lawyers to deeply review the license in 2016. It's beyond my legal skills to say if the conclusion made it more or less of an issue, at least the boundaries should now be more clear, for those who understand these matters more.

https://canonical.com/blog/zfs-licensing-and-linux

https://softwarefreedom.org/resources/2016/linux-kernel-cddl...

> the only reason ZFS isn't the norm is because we all once lived through a primordial era when it didn't exist.

There were good filesystems before ZFS. I would love to have a versioning filesystem like Apollo had.

How are the memory overheads of ZFS these days? In the old days, I remember balking at the extra memory required to run ZFS on the little ARM board I was using for a NAS.
That was always FUD more or less. ZFS uses RAM as its primary cache…like every other filesystem, so it if you have very little RAM for caching the performance will degrade…like every other filesystem.
But if you have a single board computer with 1 GB of RAM and several TB of ZFS, will it just be slow, or actually not run? Granted, my use case was abnormal, and I was evaluating in the early days when there were both license and quality concerns with ZFS on Linux. However, my understanding at the time was that it wouldn't actually work to have several TB in a ZFS pool with 1 GB of RAM.

My understanding is that ZFS has its own cache apart from the page cache, and the minimum cache size scales with the storage size. Did I misundertand/is my information outdated?

> will it just be slow

This. I use it on a tiny backup server with only 1 GB of RAM and a 4 TB HDD pool, it's fine. Only one machine backs up to that server at a time, and they do that at network speed (which is admittedly only 100 Mb/s, but it should go somewhat higher if it had faster network). Restore also runs ok.

Thanks for this. I initially went with xfs back when there were license and quality concerns with zfs on Linux before btrfs was a thing, and moved to btrfs after btrfs was created and matured a bit.

These days, I think I would be happier with zfs and one RAID-Z pool across all of the disks instead of individual btrfs partitions or btrfs on RAID 5.

I would think ZFS would suck on a 1GB machine due to likely being a 32 bit machine. If you had a 1GB in a 64 bit rig it should be fine.

ZFS does have its own cache (influenced by being Solaris native) but it’s very fast to evict pages.

> That was always FUD more or less.

To give some context. ZFS support de-duplication, and until fairly recently, the de-duplication data structures had to be resident in memory.

So if you used de-duplication earlier, then yes, you absolutely did need a certain amount of memory per byte stored.

However, there is absolutely no requirement to use de-duplication, and without it the memory requirements are just a small, fairly fixed amount.

It'll store writes in memory until it commits them in a so-called transaction group, so you need to have room for that. But the limits on a transaction group is configurable, so you can lower the defaults.

I don’t think I came across anyone suggesting zfs dedupe without insisting that it was effectively broken except for very specific workloads.
>That was always FUD more or less

Thank you thank you, exactly this! And additionally that cache is compressed. In the day's of 4GB machines ZFS was overkill but today...no problem.

This item has no comments currently.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal