Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants

Friday, 26 June 2020

On 6/26/20 12:43 PM, Matthew Miller wrote:
...
 On Fri, Jun 26, 2020 at 12:30:35PM -0400, Josef Bacik wrote:
> Obviously the Facebook scale, recoverability, and workload is going
> to be drastically different from a random Fedora user.  But hardware
> wise we are pretty close, at least on the disk side.  Thanks,

 Thanks. I guess it's really recoverability I'm most concerned with. I expect
 that if one of these nodes has a metadata corruption that results in an
 unbootable system, that's really no big deal in the big scheme of things.
 It's a bigger deal to home users. :)

Sure, I've answered this a few different times with various members of the 
working group committee (or whatever they're called nowadays).  I'll copy and 
paste what I said to them.  The context is "what do we do with bad drives that 
blow up at the wrong time".

Now as for what does the average Fedora user do?  I've also addressed that a 
bunch over the last few weeks, but instead of pasting like 9 emails I'll just 
summarize.

The UX of a completely fucked fs sucks, irregardless of the file system. 
Systemd currently (but will soon apparently) does not handle booting with a read 
only file system, which is essentially what you get when you have critical 
metadata corrupted.  You are dumped to a emergency shell, and then you have to 
know what to do from there.

With ext4/xfs, you mount read only or you run fsck.  With Btrfs you can do that 
too, but then there's like a whole level of other options depending on how bad 
the disk is.  I've written a lot of tools over the years (which are in 
btrfs-progs) to recover various levels of broken file systems.  To the point 
that you can pretty drastically mess up a FS and I'll still be able to pull data 
from the disk.

But, again, the UX for this _sucks_.  You have to know first of all that you 
should try mounting read only, and then you have to get something plugged into 
the box and copy it over.  And then assume the worst, you can't mount read only. 
  Now with ext4/xfs that's it, you are done.  With btrfs you are just getting 
started.  You have several built in mount options for recovering different 
failures, all read only.  But you have to know that they are there and how to 
use them.

These things are easily addressed with documentation, but that's only so good. 
This sort of scenario needs to be baked into Fedora itself, because it's the 
same problem no matter which file system you use.  Thanks,

Josef

Email elaborating my comments about btrfs's sensitivity to bad hardware and how 
we test.

---------------

The fact is I can make any file system unmountable with the right corruption.
The only difference with btrfs is that our metadata is completely dynamic, while
xfs and ext4 are less so.  So they're overwriting the same blocks over and over
again, and there is simply less of "important" metadata for the file system to
function.

The "problem" that btrfs has is it's main strength, it does COW.  That means
our
important metadata is constantly being re-written to different segments of the
disk.  So if you have a bad disk, you are much more likely to get unlucky and
end up with some core piece of metadata getting corrupted, and thus resulting in
a file system that cannot be mounted read/write.

Now you are much more likely to hit this in a data segment, because generally
speaking there's more data writes than metadata writes.  The thing I brought up
in the meeting last week was a potential downside for sure, but not something
that will be a common occurrence.  I just checked the fleet for this week and
we've had to reprovision 20 machines out of 138 machines that threw crc errors,
out of N total machines with btrfs fs'es, which is in the millions.  In the same
time period I have 15 xfs boxes that needed to be reprovisioned because of
metadata corruption, out of <100k machines that have xfs.  I don't have data on
ext4 because it doesn't exist in our fleet anymore.

As for testing, there are 8 tests in xfstests that utilize my dm-log-writes
target.  These tests mount the file system, do a random workload, and then
replay the workload one write at a time to validate the file system isn't left
in some intermediate broken state.  This simulates the case of weird things
happening but in a much more concrete and repeatable manner.

There's 65 tests that utilize dm-flakey, which randomly corrupts or drops
writes, and again these are to test different scenarios that have given us
issues in the past.  There's more of these because up until a few years ago this
was our only mechanism for testing this class of failures.  I wrote
dm-log-writes to bring some determinism to our testing.

All of our file systems in linux are extremely thoroughly tested for a variety
of power fail cases.  The only area that btrfs is more likely to screw up is in
the case of bad hardware, and again we're not talking like huge percentage
points difference.  It's a trade off.  You are trading a slight increased
percentage that bad hardware will result in a file system that cannot be mounted
read/write for the ability to detect silent corruption from your memory, cpu, or
storage device.  Thanks,

Josef

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants