On Mon, Jun 29, 2020 at 10:26:37AM -0600, Chris Murphy wrote:
You've got an example where 'btrfs restore' saw no files
at all? And
you think it's the file system rather than the hardware, why?
Because the system failed to boot up, and even after offline repair
attempts was still missing a sufficiently large chunk of the root
filesystem to necessitate re-installation.
Because the same hardware provided literally years of problem-free
stability with ext4 (before) and xfs (after).
I think this is the wrong metaphor because it suggests btrfs caused
the crapping. The sequence is: btrfs does the right thing, drive
firmware craps itself and there's a power failure or a crash. Btrfs in
the ordinary case doesn't care and boots without complaint. In the far
The first time, I needed to physically move the system, so the machine
was shut down via 'shutdown -h now' on a console, and didn't come back
up.
The second time was a routine post-dnf-update 'reboot', without power
cycling anything.
At no point was there ever any unclean shutdown, and at the time of
those reboots, no errors were reported in the kernel logs.
Once is a fluke, twice is a trend... and I didn't have the patience for
a third try because I needed to be able to rely on the system to not eat
itself.
I can't get the complete details at the moment, but it was an AMD E-350
system with an 32GB ADATA SATA, configured using anaconda's btrfs
defaults and only about 30% of disk space used. Pretty minimal I/O.
I will concede that it's possible there was/is some sort
hardware/firmware bug, but if so, only btrfs seemed to trigger it.
(more on this later)
Come on. It's cleanly unmounted and doesn't mount?
Yes. (See above)
(Granted, I'm using "mount" to mean "successfully mounted a writable
filesystem with data largely intact" -- I'm a bit fuzzy on the exact
details but I believe the it did mount read-only before the boot
crapped out due to missing/inaccessable system libraries. I had to
resort to a USB stick to attempt repairs that were only partially
successful)
All file systems have write ordering expectations. If the hardware
doesn't honor that, it's trouble if there's a crash. What you're
describing is 100% a hardware crapped itself case. You said it cleanly
unmounted i.e. the exact correct write ordering did happen. And yet
the file system can't be mounted again. That's a hardware failure.
That may be the case, but when there were no crashes, and neither ext4
nor xfs crapped themselves under day-to-day operation with the same
hardware, it's reasonable to infer that the problem has _something_ to
do with the variable that changed, ie btrfs.
There is no way for one person to determine if Btrfs is ready.
That's
done by combination of synthetic tests (xfstests) and volume
regression testing on actual workloads. And by the way the Red Hat CKI
project is going to help run btrfs xfstests for Fedora kernels.
Of course not, but the Fedora commnuity is made up of innumerable "one
persons" each responsible for several special snowflake systems.
Let's say for sake of argument that my bad btrfs experiences were due to
bugs in device firmware with btrfs's completely-legal usage patterns
rather than bugs in btrfs-from-five-years-ago. That's great... except
my system still got trashed to the point of needing to be reinstalled,
and finger-pointing can't bring back lost data.
How many more special snowflake drives are out there? Think about how
long it took Fedora to enable TRIM out of concern for potential data
loss. Why should this be any different?
(We're always going to be stuck with buggy firmware. FFS, the Samsung
860 EVO SATA SSD that I have in my main workstation will hiccup to the
point of trashing data when used with AMD SATA controllers... even
under Windows! Their official support answer is "Use an Intel
controller". And that's a tier-one manufacturer who presumably has
among the best QA and support in the industry..)
If there is device/firmware known to be problematic, we need to keep
track of these buggy devices and either automatically provide
workarounds or some way to tell the user that proceeding with btrfs may
be perilous to their data.
(Or perhaps the issues I had were due to bugs in btrfs-of-five-years-ago
that have long since been fixed. Either way, given my twice-burned
experiences, I would want to verify that for myself before I entrust it
with any data I care about...)
The questions are whether the Fedora community wants and is ready
for
Btrfs by default.
There are obviously some folks here (myself included) that have had very
negative btrfs experiences. Similarly, there are folks that have
successfully overseen large-scale deployements of btrfs in their managed
enviroments (not on Fedora though, IIUC)
So yes, I think an explicit "let's all test btrfs (as anaconda
configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the
past two years (since 2018-or-later-btrfs is what everyone with positive
results appears to be talking about), but it's still not clear that
those deployments utilize the same feature set as Fedora's defaults, and
how broad the hardware sample is.
- Solomon
--
Solomon Peachy pizza at shaftnet dot org (email&xmpp)
@pizza:shaftnet dot org (matrix)
High Springs, FL speachy (freenode)