On 6/26/20 12:43 PM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 12:30:35PM -0400, Josef Bacik wrote:
Obviously the Facebook scale, recoverability, and workload is going to be drastically different from a random Fedora user. But hardware wise we are pretty close, at least on the disk side. Thanks,
Thanks. I guess it's really recoverability I'm most concerned with. I expect that if one of these nodes has a metadata corruption that results in an unbootable system, that's really no big deal in the big scheme of things. It's a bigger deal to home users. :)
Sure, I've answered this a few different times with various members of the working group committee (or whatever they're called nowadays). I'll copy and paste what I said to them. The context is "what do we do with bad drives that blow up at the wrong time".
Now as for what does the average Fedora user do? I've also addressed that a bunch over the last few weeks, but instead of pasting like 9 emails I'll just summarize.
The UX of a completely fucked fs sucks, irregardless of the file system. Systemd currently (but will soon apparently) does not handle booting with a read only file system, which is essentially what you get when you have critical metadata corrupted. You are dumped to a emergency shell, and then you have to know what to do from there.
With ext4/xfs, you mount read only or you run fsck. With Btrfs you can do that too, but then there's like a whole level of other options depending on how bad the disk is. I've written a lot of tools over the years (which are in btrfs-progs) to recover various levels of broken file systems. To the point that you can pretty drastically mess up a FS and I'll still be able to pull data from the disk.
But, again, the UX for this _sucks_. You have to know first of all that you should try mounting read only, and then you have to get something plugged into the box and copy it over. And then assume the worst, you can't mount read only. Now with ext4/xfs that's it, you are done. With btrfs you are just getting started. You have several built in mount options for recovering different failures, all read only. But you have to know that they are there and how to use them.
These things are easily addressed with documentation, but that's only so good. This sort of scenario needs to be baked into Fedora itself, because it's the same problem no matter which file system you use. Thanks,
Josef
Email elaborating my comments about btrfs's sensitivity to bad hardware and how we test.
---------------
The fact is I can make any file system unmountable with the right corruption. The only difference with btrfs is that our metadata is completely dynamic, while xfs and ext4 are less so. So they're overwriting the same blocks over and over again, and there is simply less of "important" metadata for the file system to function.
The "problem" that btrfs has is it's main strength, it does COW. That means our important metadata is constantly being re-written to different segments of the disk. So if you have a bad disk, you are much more likely to get unlucky and end up with some core piece of metadata getting corrupted, and thus resulting in a file system that cannot be mounted read/write.
Now you are much more likely to hit this in a data segment, because generally speaking there's more data writes than metadata writes. The thing I brought up in the meeting last week was a potential downside for sure, but not something that will be a common occurrence. I just checked the fleet for this week and we've had to reprovision 20 machines out of 138 machines that threw crc errors, out of N total machines with btrfs fs'es, which is in the millions. In the same time period I have 15 xfs boxes that needed to be reprovisioned because of metadata corruption, out of <100k machines that have xfs. I don't have data on ext4 because it doesn't exist in our fleet anymore.
As for testing, there are 8 tests in xfstests that utilize my dm-log-writes target. These tests mount the file system, do a random workload, and then replay the workload one write at a time to validate the file system isn't left in some intermediate broken state. This simulates the case of weird things happening but in a much more concrete and repeatable manner.
There's 65 tests that utilize dm-flakey, which randomly corrupts or drops writes, and again these are to test different scenarios that have given us issues in the past. There's more of these because up until a few years ago this was our only mechanism for testing this class of failures. I wrote dm-log-writes to bring some determinism to our testing.
All of our file systems in linux are extremely thoroughly tested for a variety of power fail cases. The only area that btrfs is more likely to screw up is in the case of bad hardware, and again we're not talking like huge percentage points difference. It's a trade off. You are trading a slight increased percentage that bad hardware will result in a file system that cannot be mounted read/write for the ability to detect silent corruption from your memory, cpu, or storage device. Thanks,
Josef