On 6/26/20 12:43 PM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 12:30:35PM -0400, Josef Bacik wrote:
> Obviously the Facebook scale, recoverability, and workload is going
> to be drastically different from a random Fedora user. But hardware
> wise we are pretty close, at least on the disk side. Thanks,
Thanks. I guess it's really recoverability I'm most concerned with. I expect
that if one of these nodes has a metadata corruption that results in an
unbootable system, that's really no big deal in the big scheme of things.
It's a bigger deal to home users. :)
Sure, I've answered this a few different times with various members of the
working group committee (or whatever they're called nowadays). I'll copy and
paste what I said to them. The context is "what do we do with bad drives that
blow up at the wrong time".
Now as for what does the average Fedora user do? I've also addressed that a
bunch over the last few weeks, but instead of pasting like 9 emails I'll just
The UX of a completely fucked fs sucks, irregardless of the file system.
Systemd currently (but will soon apparently) does not handle booting with a read
only file system, which is essentially what you get when you have critical
metadata corrupted. You are dumped to a emergency shell, and then you have to
know what to do from there.
With ext4/xfs, you mount read only or you run fsck. With Btrfs you can do that
too, but then there's like a whole level of other options depending on how bad
the disk is. I've written a lot of tools over the years (which are in
btrfs-progs) to recover various levels of broken file systems. To the point
that you can pretty drastically mess up a FS and I'll still be able to pull data
from the disk.
But, again, the UX for this _sucks_. You have to know first of all that you
should try mounting read only, and then you have to get something plugged into
the box and copy it over. And then assume the worst, you can't mount read only.
Now with ext4/xfs that's it, you are done. With btrfs you are just getting
started. You have several built in mount options for recovering different
failures, all read only. But you have to know that they are there and how to
These things are easily addressed with documentation, but that's only so good.
This sort of scenario needs to be baked into Fedora itself, because it's the
same problem no matter which file system you use. Thanks,
Email elaborating my comments about btrfs's sensitivity to bad hardware and how
The fact is I can make any file system unmountable with the right corruption.
The only difference with btrfs is that our metadata is completely dynamic, while
xfs and ext4 are less so. So they're overwriting the same blocks over and over
again, and there is simply less of "important" metadata for the file system to
The "problem" that btrfs has is it's main strength, it does COW. That means
important metadata is constantly being re-written to different segments of the
disk. So if you have a bad disk, you are much more likely to get unlucky and
end up with some core piece of metadata getting corrupted, and thus resulting in
a file system that cannot be mounted read/write.
Now you are much more likely to hit this in a data segment, because generally
speaking there's more data writes than metadata writes. The thing I brought up
in the meeting last week was a potential downside for sure, but not something
that will be a common occurrence. I just checked the fleet for this week and
we've had to reprovision 20 machines out of 138 machines that threw crc errors,
out of N total machines with btrfs fs'es, which is in the millions. In the same
time period I have 15 xfs boxes that needed to be reprovisioned because of
metadata corruption, out of <100k machines that have xfs. I don't have data on
ext4 because it doesn't exist in our fleet anymore.
As for testing, there are 8 tests in xfstests that utilize my dm-log-writes
target. These tests mount the file system, do a random workload, and then
replay the workload one write at a time to validate the file system isn't left
in some intermediate broken state. This simulates the case of weird things
happening but in a much more concrete and repeatable manner.
There's 65 tests that utilize dm-flakey, which randomly corrupts or drops
writes, and again these are to test different scenarios that have given us
issues in the past. There's more of these because up until a few years ago this
was our only mechanism for testing this class of failures. I wrote
dm-log-writes to bring some determinism to our testing.
All of our file systems in linux are extremely thoroughly tested for a variety
of power fail cases. The only area that btrfs is more likely to screw up is in
the case of bad hardware, and again we're not talking like huge percentage
points difference. It's a trade off. You are trading a slight increased
percentage that bad hardware will result in a file system that cannot be mounted
read/write for the ability to detect silent corruption from your memory, cpu, or
storage device. Thanks,