It's far from perfect, but it has no peers.
I spent many years stubbornly using btrfs and lost data multiple times. Never once did the redundancy I had supposedly configured actually do anything to help me. ZFS has identified corruption caused by bad memory and a bad CPU and let me know immediately which files were damaged.
Former LimeWire developer here... the LimeWire splash screen at startup was due to experiences with silent data corruption. We got some impossible bug reports, so we created a stub executable that would show a splash screen while computing the SHA-1 checksums of the actual application DLLs and JARs. Once everything checked out, that stub would use Java reflection to start the actual application. After moving to that, those impossible bug reports stopped happening. With 60 million simultaneous users, there were always some of them with silent disk corruption that they would blame on LimeWire.
When Microsoft was offering free Win7 pre-release install ISOs for download, I was having install issues. I didn't want to get my ISO illegally, so I found a torrent of the ISO, and wrote a Python script to download the ISO from Microsoft, but use the torrent file to verify chunks and re-download any corrupted chunks. Something was very wrong on some device between my desktop and Microsoft's servers, but it eventually got a non-corrupted ISO.
It annoys me to no end that ECC isn't the norm for all devices with more than 1 GB of RAM. Silent bit flips are just not okay.
Edit: side note: it's interesting to see the number of complaints I still see from people who blame hard drive failures on LimeWire stressing their drives. From very early on, LimeWire allowed bandwidth limiting, which I used to keep heat down on machines that didn't cool their drives properly. Beyond heat issues that I would blame on machine vendors, failures from write volume I would lay at the feet of drive manufacturers.
Though, I'm biased. Any blame for drive wear that didn't fall on either the drive manufacturers or the filesystem implementers not dealing well with random writes would probably fall at my feet. I'm the one who implemented randomized chunk order downloading in order to rapidly increase availability of rare content, which would increase the number of hard drive head seeks on non-log-based filesystems. I always intended to go back and (1) use sequential downloads if tens of copies of the file were in the swarm, to reduce hard drive seeks and (2) implement randomized downloading of rarest chunks first, rather than the naive randomization in the initial implementation. I say naive, but the initial implementation did have some logic to randomize chunk download order in a way to reduce the size of the messages that swarms used to advertise which peers had which chunks. As it turns out, there were always more pressing things to implement and the initial implementation was good enough.
(Though, really, all read-write filesystems should be copy-on-write log-based, at least for recent writes, maybe having some background process using a count-min-sketch to estimate locality for frequently read data and optimize read locality for rarely changing data that's also frequently read.)
Edit: Also, it's really a shame that TCP over IPv6 doesn't use CRC-32C (to intentionally use a different CRC polynomial than Ethernet, to catch more error patterns) to end-to-end checksum data in each packet. Yes, it's a layering abstraction violation, but IPv6 was a convenient point to introduce a needed change. On the gripping hand, it's probably best in the big picture to raise flow control, corruption/loss detection, retransmission (and add forward error correction) in libraries at the application layer (a la QUIC, etc.) and move everything to UDP. I was working on Google's indexing system infra when they switched transatlantic search index distribution from multiple parallel transatlantic TCP streams to reserving dedicated bandwidth from the routers and blasting UDP using rateless forward error codes. Provided that everyone is implementing responsible (read TCP-compatible) flow control, it's really good to have the rapid evolution possible by just using UDP and raising other concerns to libraries at the application layer. (N parallel TCP streams are useful because they typically don't simultaneously hit exponential backoff, so for long-fat networks, you get both higher utilization and lower variance than a single TCP stream at N times the bandwidth.)
It's not my field, but my impression is that it would be equally resilient to just randomise the start block (adjust spacing of start blocks according to user bandwidth?) then let users just run through the download serially; maybe stopping when they hit blocks that have multiple sources and then skipping to a new start block?
It's kinda mindbogglingly to me too think of all the processes that go into a 'simple' torrent download at the logical level.
If AIs get good enough before I die then asking it to create simulations on silly things like this will probably keep me happy for all my spare time!
Trying to extend locally available extents if possible was desirable because peers advertised block availability as pairs of <offset, length>, so minimizing the number of extents minimized network message sizes.
This initial prototype algorithm (1) minimized disk seeks (after the initial phase of getting the first block and 3 other random blocks) by always downloading the block after the previous download, if possible. (2) Minimized network message size for advertising available extents by extending existing extents if possible.
Unfortunately, in simulation this initial prototype algorithm biased availability of blocks in rare files, biasing in favor of blocks toward the end of the file. Any bias is bad for rapidly spreading rare content, and bias in favor of the end of the file is particularly bad for audio and video file types where people like to start listening/watching while the file is still being downloaded.
Instead, the algorithm in the initial production implementation was to first check the file extension against a list of extensions likely to be accessed by the user while still downloading (mp3, ogg, mpeg, avi, wma, asf, etc.).
For the case where the file extension indicates the user is unlikely to access the content until the download is finished (the general case algorithm), look at the number of extents (continuous ranges of bytes the user already has). If the number of extents is less than 4, pick any block randomly from the list of blocks that peers were offering for download. If there are 4 or more extents available locally, for each end of each extent available locally, check the block before it and the block after it to see if they're available for download from peers. If this list of available adjacent blocks is non-empty, then randomly chose one of those adjacent blocks for download. If the list of available adjacent blocks is empty, then uniformly randomly chose from one of the blocks available from peers.
In the case of file types likely to be viewed while being downloaded, it would download from the front of the file until the download was 50% complete, and then randomly either download the first needed block, or else use the previously described algorithm, with the probability of using the previous (randomized) algorithm increasing as the percentage of the download completed increased. There was also some logic to get the last few chunks of files very early in the download for file formats that required information from a file footer in order to start using them (IIRC, ASF and/or WMA relied on footer information to start playing).
Internally, there was also logic to check if a chunk was corrupted (using a Merkle tree using the Tiger hash algorithm). We would ignore the corrupted chunks when calculating the percentage completed, but would remove corrupted chunks from the list of blocks we needed to download, unless such removal resulted in an empty list of blocks needed for download. In this way, we would avoid re-downloading corrupted blocks unless we had nothing else to do. This would avoid the case where one peer had a corrupted block and we just kept re-requesting the same corrupted block from the peer as soon as we detected corruption. There was some logic to alert the user if too many corrupted blocks were detected and give the user options to stop the download early and delete it, or else to keep downloading it and just live with a corrupted file. I felt there should have been a third option to keep downloading until a full-but-corrupt download was had, retry downloading every corrupt block once, and then re-prompt the user if the file was still corrupt. However, this option would have resulted in more wasted bandwidth and likely resulted in more user frustration due to some of them hitting "keep trying" repeatedly instead of just giving up as soon as it was statistically unlikely they were going to get a non-corrupted download. Indefinite retries without prompting the user were a non-starter due to the amount of bandwidth they would waste.
Earlier, there used to be lot of fears around the license, with Torvalds advising against its use, both for that reason and for lack of maintainers. Now i believe that has been mostly ironed out and should be less of an issue.
In this one very narrow sense, we are agreed, if we are talking about Linux on root. IMHO it should also have been virtually everywhere else. It should have been in MacOS, etc.
However, I think your particular comment may miss the forest for the trees. Yes, ZFS was difficult to set up for Linux, because Linux people disfavored its use (which you do touch upon later).
People sometimes imagine that purely technical considerations govern the technical choices of remote groups. However, I think when people say "all tech is political" in the cultural-war-ing American politics sense, they may be right, but they are absolutely right in the small ball open source politics sense.
Linux communities were convinced not to include or build ZFS support. Because licensing was a problem. Because btrfs was coming and would be better. Because Linus said ZFS was mostly marketing. So they didn't care to build support. Of course, this was all BS or FUD or NIH, but it was what happened, not that ZFS had new and different recovery tool, or was less reliable in the arbitrary past. It was because the Linux community engaged in its own (successful) FUD campaign against another FOSS project.
Has this changed ? ZFS comes with a BSD view of the world (i.e slices). It also needed a sick amount of RAM to function properly.
Or do you think people simply stopped paying attention?
https://canonical.com/blog/zfs-licensing-and-linux
https://softwarefreedom.org/resources/2016/linux-kernel-cddl...
There were good filesystems before ZFS. I would love to have a versioning filesystem like Apollo had.
My understanding is that ZFS has its own cache apart from the page cache, and the minimum cache size scales with the storage size. Did I misundertand/is my information outdated?
This. I use it on a tiny backup server with only 1 GB of RAM and a 4 TB HDD pool, it's fine. Only one machine backs up to that server at a time, and they do that at network speed (which is admittedly only 100 Mb/s, but it should go somewhat higher if it had faster network). Restore also runs ok.
These days, I think I would be happier with zfs and one RAID-Z pool across all of the disks instead of individual btrfs partitions or btrfs on RAID 5.
ZFS does have its own cache (influenced by being Solaris native) but it’s very fast to evict pages.
To give some context. ZFS support de-duplication, and until fairly recently, the de-duplication data structures had to be resident in memory.
So if you used de-duplication earlier, then yes, you absolutely did need a certain amount of memory per byte stored.
However, there is absolutely no requirement to use de-duplication, and without it the memory requirements are just a small, fairly fixed amount.
It'll store writes in memory until it commits them in a so-called transaction group, so you need to have room for that. But the limits on a transaction group is configurable, so you can lower the defaults.
Thank you thank you, exactly this! And additionally that cache is compressed. In the day's of 4GB machines ZFS was overkill but today...no problem.
I've said it once, and I'll say it again: the only reason ZFS isn't the norm is because we all once lived through a primordial era when it didn't exist. No serious person designing a filesystem today would say it's okay to misplace your data.
Not long ago, on this forum, someone told me that ZFS is only good because it had no competitors in its space. Which is kind of like saying the heavyweight champ is only good because no one else could compete.