https://fedoraproject.org/wiki/Changes/BtrfsByDefault
== Summary ==
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
== Owners ==
* Names: [[User:Chrismurphy|Chris Murphy]], [[User:Ngompa|Neal Gompa]], [[User:Josef|Josef Bacik]], [[User:Salimma|Michel Alexandre Salim]], [[User:Dcavalca|Davide Cavalca]], [[User:eeickmeyer|Erich Eickmeyer]], [[User:ignatenkobrain|Igor Raits]], [[User:Raveit65|Wolfgang Ulbrich]], [[User:Zsun|Zamir SUN]], [[User:rdieter|Rex Dieter]], [[User:grinnz|Dan Book]], [[User:nonamedotc|Mukundan Ragavan]] * Emails: chrismurphy@fedoraproject.org, ngompa13@gmail.com, josef@toxicpanda.com, michel@michel-slm.name, dcavalca@fb.com, erich@ericheickmeyer.com, ignatenkobrain@fedoraproject.org, fedora@raveit.de, zsun@fedoraproject.org, rdieter@gmail.com, grinnz@gmail.com, nonamedotc@gmail.com
* Products: All desktop editions, spins, and labs * Responsible WGs: Workstation Working Group, KDE Special Interest Group
== Detailed Description ==
Fedora desktop edition/spin variants will switch to using Btrfs as the filesystem by default for new installs. Labs derived from these variants inherit this change, and other editions may opt into this change.
The change is based on the installer's custom partitioning Btrfs preset. It's been well tested for 7 years.
'''''Current partitioning'''''<br /> <span style="color: tomato">vg/root</span> LV mounted at <span style="color: tomato">/</span> and a <span style="color: tomato">vg/home</span> LV mounted at <span style="color: tomato">/home</span>. These are separate file system volumes, with separate free/used space.
'''''Proposed partitioning'''''<br /> <span style="color: tomato">root</span> subvolume mounted at <span style="color: tomato">/</span> and <span style="color: tomato">home</span> subvolume mounted at <span style="color: tomato">/home</span>. Subvolumes don't have size, they act mostly like directories, space is shared.
'''''Unchanged'''''<br /> <span style="color: tomato">/boot</span> will be a small ext4 volume. A separate boot is needed to boot dm-crypt sysroot installations; it's less complicated to keep the layout the same, regardless of whether sysroot is encrypted. There will be no automatic snapshots/rollbacks.
If you select to encrypt your data, LUKS (dm-crypt) will be still used as it is today (with the small difference that Btrfs is used instead of LVM+Ext4). There is upstream work on getting native encryption for Btrfs that will be considered once ready and is subject of a different change proposal in a future Fedora release.
=== Optimizations (Optional) ===
The detailed description above is the proposal. It's intended to be a minimalist and transparent switch. It's also the same as was [[Features/F16BtrfsDefaultFs|proposed]] (and [https://lwn.net/Articles/446925/ accepted]) for Fedora 16. The following optimizations improve on the proposal, but are not critical. They are also transparent to most users. The general idea is agree to the base proposal first, and then consider these as enhancements.
==== Boot on Btrfs ====
* Instead of a 1G ext4 boot, create a 1G Btrfs boot. * Advantage: Makes it possible to include in a snapshot and rollback regime. GRUB has stable support for Btrfs for 10+ years. * Scope: Contingent on bootloader and installer team review and approval. blivet should use <code>mkfs.btrfs --mixed</code>.
==== Compression ====
* Enable transparent compression using zstd on select directories: <span style="color: tomato">/usr</span> <span style="color: tomato">/var/lib/flatpak</span> <span style="color: tomato">~/.local/share/flatpak</span> * Advantage: Saves space and significantly increase the lifespan of flash-based media by reducing write amplification. It may improve performance in some instances. * Scope: Contingent on installer team review and approval to enhance anaconda to perform the installation using <code>mount -o compress=zstd</code>, then set the proper XATTR for each directory. The XATTR can't be set until after the directories are created via: rsync, rpm, or unsquashfs based installation.
==== Additional subvolumes ====
* <span style="color: tomato">/var/log/</span> <span style="color: tomato">/var/lib/libvirt/images</span> and <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> will use separate subvolumes. * Advantage: Makes it easier to excluded them from snapshots, rollbacks, and send/receive. (Btrfs snapshotting is not recursive, it stops at a nested subvolume.) * Scope: Anaconda knows how to do this already, just change the kickstart to add additional subvolumes (minus the subvolume in <span style="color: tomato">~/</span>. GNOME Boxes will need enhancement to detect that the user home is on Btrfs and create <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> as a subvolume.
== Feedback ==
==== Red Hat doesn't support Btrfs? Can Fedora do this? ====
Red Hat supports Fedora well, in many ways. But Fedora already works closely with, and depends on, upstreams. And this will be one of them. That's an important consideration for this proposal. The community has a stake in ensuring it is supported. Red Hat will never support Btrfs if Fedora rejects it. Fedora necessarily needs to be first, and make the persuasive case that it solves more problems than alternatives. Feature owners believe it does, hands down.
The Btrfs community has users that have been using it for most of the past decade at scale. It's been the default on openSUSE (and SUSE Linux Enterprise) since 2014, and Facebook has been using it for all their OS and data volumes, in their data centers, for almost as long. Btrfs is a mature, well-understood, and battle-tested file system, used on both desktop/container and server/cloud use-cases. We do have developers of the Btrfs filesystem maintaining and supporting the code in Fedora, one is a Change owner, so issues that are pinned to Btrfs can be addressed quickly.
==== What about device-mapper alternatives? ====
dm-thin (thin provisioning): [[https://pagure.io/fedora-workstation/issue/152 Issue #152] still happens, because the installer won't over provision by default. It still requires manual intervention by the user to identify and resolve the problem. Upon growing a file system on dm-thin, the pool is over committed, and file system sizes become a fantasy: they don't add up to the total physical storage available. The truth of used and free space is only known by the thin pool, and CLI and GUI programs are unprepared for this. Integration points like rpm free space checks or GNOME disk-space warnings would have to be adapted as well.
dm-vdo: is not yet merged, and isn't as straightforward to selectively enable per directory and per file, as is the case on Btrfs using <code>chattr +c</code> on <span style="color: tomato">/var/lib/flatpaks/</span>.
Btrfs solves the problems that need solving, with few side effects or pitfalls for users. It has more features we can take advantage of immediately and transparently: compression, integrity, and IO isolation. Many Btrfs features and optimizations can be opted into selectively per directory or file, such as compression and nodatacow, rather than as a layer that's either on or off.
==== What about UI/UX and integration in the desktop? ====
If Btrfs isn't the default file system, there's no commitment, nor reason to work on any UI/UX integration. There are ideas to make certain features discoverable: selective compression; systemd-homed may take advantage of either Btrfs online resize, or near-term planned native encryption, which could make it possible to live convert non-encrypted homes to encrypted; and system snapshot and rollbacks.
Anaconda already has sophisticated Btrfs integration.
==== What Btrfs features are recommended and supported? ====
The primary goal of this feature is to be largely transparent to the user. It does not require or expect users to learn new commands, or to engage in peculiar maintenance rituals.
The full set of Btrfs features that is considered stable and enabled by default upstream will be enabled in Fedora. Fedora is a community project. What is supported within Fedora depends on what the community decides to put forward in terms of resources.
The upstream [https://btrfs.wiki.kernel.org/index.php/Status Btrfs feature status page].
==== Are subvolumes really mostly like directories? ====
Subvolumes behave like directories in terms of navigation in both the GUI and CLI, e.g. <code>cp</code>, <code>mv</code>, <code>du</code>, owner/permissions, and SELinux labels. They also share space, just like a directory.
But it is an incomplete answer.
A subvolume is an independent file tree, with its own POSIX namespace, and has its own pool of inodes. This means inode numbers repeat themselves on a Btrfs volume. Inodes are only unique within a given subvolume. A subvolume has its own st_dev, so if you use <code>stat FILE</code> it reports a device value referring to the subvolume the file is in. And it also means hard links can't be created between subvolumes. From this perspective, subvolumes start looking more like a separate file system. But subvolumes share most of the other trees, so they're not truly independent file systems. They're also not block devices.
== Benefit to Fedora ==
Problems Btrfs helps solve:
* Users running out of free space on either <span style="color: tomato">/</span> or <span style="color: tomato">/home</span> [https://pagure.io/fedora-workstation/issue/152 Workstation issue #152] ** "one big file system": no hard barriers like partitions or logical volumes ** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware ** reflinks and snapshots are more efficient for use cases like containers (Podman supports both) * Storage devices can be flaky, resulting in data corruption ** Everything is checksummed and verified on every read ** Corrupt data results in EIO (input/output error), instead of resulting in application confusion, and isn't replicated into backups and archives * Poor desktop responsiveness when under pressure [https://pagure.io/fedora-workstation/issue/154 Workstation issue #154] ** Currently only Btrfs has proper IO isolation capability via cgroups2 ** Completes the resource control picture: memory, cpu, IO isolation * File system resize ** Online shrink and grow are fundamental to the design * Complex storage setups are... complicated ** Simple and comprehensive command interface. One master command ** Simpler to boot, all code is in the kernel, no initramfs complexities ** Simple and efficient file system replication, including incremental backups, with <code>btrfs send</code> and <code>btrfs receive</code>
== Scope == * Proposal owners: ** Submit PR's for Anaconda to change <code>default_scheme = BTRFS</code> to the proper product files. ** Multiple test days: build community support network ** Aid with documentation
* Other developers: ** Anaconda, review PRs and merge ** Bootloader team, review PRs and merge ** Recommended optimization <code>chattr +C</code> set on the containing directory for virt-manager and GNOME Boxes.
* Release engineering: [https://pagure.io/releng/issue/9545 #9545]
* Policies and guidelines: N/A
* Trademark approval: N/A
== Upgrade/compatibility impact ==
Change will not affect upgrades.
Documentation will be provided for existing Btrfs users to "retrofit" their setups to that of a default Btrfs installation (base plus any approved options).
== How To Test ==
'''''Today'''''<br /> Do a custom partitioning installation; change the scheme drop-down menu to Btrfs; click the blue "automatically create partitions"; and install.<br /> Fedora 31, 32, Rawhide, on x86_64 and ARM.
'''''Once change lands'''''<br /> It should be simple enough to test, just do a normal install.
== User Experience ==
==== Pros ====
* Mostly transparent * Space savings from compression * Longer lifespan of hardware, also from compression. * Utilities for used and free space, CLI and GUI, are expected to behave the same. No special commands are required. * More detailed information can be revealed by <code>btrfs</code> specific commands.
==== Enhancement opportunities ====
[https://bugzilla.redhat.com/show_bug.cgi?id=906591 updatedb does not index /home when /home is a bind mount] Also can affected rpm-ostree installations, including Silverblue.
[https://gitlab.gnome.org/GNOME/gnome-usage/-/issues/49 GNOME Usage: Incorrect numbers when using multiple btrfs subvolumes] This isn't Btrfs specific, happens with "one big ext4" volume as well.
[https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/88 GNOME Boxes, RFE: create qcow2 with 'nocow' option when on btrfs /home] This is Btrfs specific, and is a recommended optimization for both GNOME Boxes and virt-manager.
[https://github.com/containers/libpod/issues/6563 containers/libpod: automatically use btrfs driver if on btrfs]
== Dependencies ==
None.
== Contingency Plan ==
* Contingency mechanism: Owner will revert changes back to LVM+ext4 * Contingency deadline: Beta freeze
* Blocks release? Yes * Blocks product? Workstation and KDE
== Documentation ==
Strictly speaking no documentation is required reading for users. But there will be some Fedora documentation to help get the ball rolling.
For those who want to know more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page btrfs wiki main page and full feature list.]
<code>man 5 btrfs</code> contains: mount options, features, swapfile support, checksum algorithms, and more<br /> <code>man btrfs</code> contains an overview of the btrfs subcommands<br /> <code>man btrfs <nowiki><subcommand></nowiki></code> will show the man page for that subcommand
NOTE: The btrfs command will accept partial subcommands, as long as it's not ambiguous. These are equivalent commands:<br /> <code>btrfs subvolume snapshot</code><br /> <code>btrfs sub snap</code><br /> <code>btrfs su sn</code>
You'll discover your own convention. It might be preferable to write out the full command on forums and lists, but then maybe some folks don't learn about this useful shortcut?
For those who want to know a lot more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation Btrfs developer documentation]<br /> [https://github.com/btrfs/btrfs-dev-docs/blob/master/trees.txt Btrfs trees]
== Release Notes == The default file system on the desktop is Btrfs.
On 26.06.2020 16:42, Ben Cotton wrote:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
Such changes could affect Fedora reputation among other distributions.
On Fri, Jun 26, 2020 at 04:58:19PM +0200, Vitaly Zaitsev via devel wrote:
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
That certainly would be concerning, but do you have citations on this? I did a search on reddit and did not find a significant number of such complaints in the top results -- in fact, mostly positive reports. For kernel bugzilla issues, do you have numbers compared to other filesystems?
My Reddit search _did_ turn up this presentation from Usenix:
https://www.usenix.org/conference/atc19/presentation/jaffer
From that in part:
* ext4 has significantly improved over ext3 in both detection and recovery from data corruption and I/O injection errors. Our extensive test suite generates only minor errors or datalosses in the file system, in stark contrast with [a 2005 paper], where ext3 was reported to silently discard write errors.
* On the other hand, Btrfs, which is a production grade filesystem with advanced features like snapshot and cloning, has good failure detection mechanisms, but is unable to recover from errors that affect its key data structures, partially due to disabling metadata replication when deployed on SSDs.
[...]
* We notice potentially fatal omissions in error detection andrecovery for all file systems except for ext4. This is concern-ing since technology trends, such as continually growing SSDdrive capacities and increasing densities as QLC drives whichare coming on the market, all seem to point towards increas-ing rather than decreasing SSD error rates in the future. [...]
On Fri, Jun 26, 2020 at 04:58:19PM +0200, Vitaly Zaitsev via devel wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
Anecdata… OTOH, I'm using btrfs on most of my machines. I had one data loss, when RAM module went bad and caused corruption in bcache attached to my btrfs /. It was neither fault of bcache nor btrfs.
On Fri, 26 Jun 2020 at 16:05, Vitaly Zaitsev via devel < devel@lists.fedoraproject.org> wrote: [..]
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen.
I would be really interested how you came to that conclusion (how did you measure that?). Do you have any metrics data which shows Linux filesystems stability?
Does anyone know any source of some data which could be used to put all Linux filesystems on some stability ruler? Maybe some FS crash statistics taken from systems working on the same/similar HW in some DCs?
kloczek
On Fri, Jun 26, 2020 at 8:58 AM Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
I've got a Samsung 840 EVO that I know has firmware bugs. Is that an ideal condition? What about compiling webkitgtk and losing control of the system under load (unresponsive GUI while the compiling continues to write)? Is it an ideal condition? And because I'm notoriously impatient, I often yank the power cord. Ideal condition? And I've done this over 100 times in the last year. Ideal condition?
100% of the subsequent cold boots, boot identical to that of a prior clean shutdown. Zero btrfs complaints. One person, one laptop, one SSD. I'm not a totally disqualified scientific sample but it's a really insignificant anecdote, other than even at this scale if there were intrinsic file system defects, I think I'd have seen it.
Question is, what happens when the firmware has a hiccup and I also get a power fail. What am I likely to see, and what do I do? When there are problems, we're used to a particular pattern with ext4. That pattern will change with btrfs. There will be fewer of some problems, more of others, and the messages will be different. fsck.ext4 is pretty much all we have, all we're used to, and it's a binary pass/fail. Even though we're talking about edge cases at this level, those who get unlucky for whatever reason are going to need a community of user to user support giving them good advice. Will Fedora?
It's also important to talk about what's left on the table *without* this change. The potential to almost transparently drop in a new file system that extends the life of user's hardware, eliminates the free space competition problem between /home and /, and allocates it more efficiently. And asks *less* of day to day users, while inviting *more* from those who want to explore more features. On the same file system.
The fear/concern component is real, it has to be addressed and not dismissed. But that component is already present with what we have. We're just used to it. Is there enough of a sense of adventure and bravery in Fedora to overcome the fear component, and in exchange we get a modern file system that actually helps us solve problems we're having today right now? And offers features that beg for future creativity and innovation?
I think the answer is yes, but the Fedora community is going to have to decide.
On 6/26/20 12:31 PM, Chris Murphy wrote:
That pattern will change with btrfs. There will be fewer of some problems, more of others, and the messages will be different. fsck.ext4 is pretty much all we have, all we're used to, and it's a binary pass/fail. Even though we're talking about edge cases at this level, those who get unlucky for whatever reason are going to need a community of user to user support giving them good advice. Will Fedora?
Well said. BTRFS is more complex and will require getting used to.
In case of FS trouble, everyone knows 'fsck' but as Josef wrote
With btrfs you are just getting started. You have several built in mount options for recovering different failures, all read only. But you have to know that they are there and how to use them.
which is both encouraging and terrifying :)
I remember that two issues that made me apprehensive wrt. BTRFS were its handling of the 'disk full' situation, and lack of a staightforward 'fsck' workflow. I think the first issue has been resolved, and we probably just need some docs and scripts that handle file system corruption by remounting R/O and printing some suggestions what to do next.
It's also important to talk about what's left on the table*without* this change. The potential to almost transparently drop in a new file system that extends the life of user's hardware, eliminates the free space competition problem between /home and /, and allocates it more efficiently. And asks*less* of day to day users, while inviting *more* from those who want to explore more features. On the same file system.
For what it's worth, this is really needed, and overdue. I have repeatedly failed Fedora OS release upgrades on different machines by running out of root fs space. I think the default / is around 50GB, and it's too easy to fill: during OS update we need space for three copies of each package: the old version, the downloaded new version, and the space to install the new version.
Even though technically dnf system-upgrade can --download-dir to a location off / it doesn't seem to work with the actual upgrade, so the only way I know is to delete largest packages (flightGear*, piglit*, KiCAD*, ...) and reinstall them after update.
One thing that hasn't been mentioned yet is that btrfs is also important for our plans to preserve system responsiveness under heavy load, https://pagure.io/fedora-workstation/issue/154.
On Fri, Jun 26, 2020 at 5:22 pm, Przemek Klosowski via devel devel@lists.fedoraproject.org wrote:
For what it's worth, this is really needed, and overdue. I have repeatedly failed Fedora OS release upgrades on different machines by running out of root fs space. I think the default / is around 50GB, and it's too easy to fill: during OS update we need space for three copies of each package: the old version, the downloaded new version, and the space to install the new version.
We raised it to 70 GB, but it's still too small. I keep running out of space too, most recently just a couple days ago. This is a problem we're determined to solve, and raising the size of / further just increases the chance of the user running out of space on /home currently, so if btrfs doesn't pass, we will (very likely) switch to single-partition ext4 (or maybe xfs). See https://pagure.io/fedora-workstation/issue/152.
On Fri, Jun 26, 2020 at 3:22 PM Przemek Klosowski via devel devel@lists.fedoraproject.org wrote:
I remember that two issues that made me apprehensive wrt. BTRFS were its handling of the 'disk full' situation, and lack of a staightforward 'fsck' workflow. I think the first issue has been resolved, and we probably just need some docs and scripts that handle file system corruption by remounting R/O and printing some suggestions what to do next.
A medium term goal is to make systemd and the desktop environment more tolerant to starting up read-only, and even though this is a limited environment the user isn't just stuck at a prompt. SUSE/openSUSE can today boot read-only snapshots as part of its rollback strategy but I'm not sure how/why it works or whether it's adaptable.
A short term goal, possibly even a requirement for the proposal, is some kind of message at a dracut prompt to at least give the user something to go on, in sequence, including even 'join us on #fedora-btrfs' or whatever. A bigger problem is that right now (a) new installs don't set a password for root user, and (b) systemd emergency target requires a root user login to get to a prompt. It has to be a mount *failure* to get to a dracut prompt where we could show some messages. There is this middle area where the user is stuck no matter the file system.
Some of these are long standing problems, but they're perhaps being spotlit by the change.
For what it's worth, this is really needed, and overdue. I have repeatedly failed Fedora OS release upgrades on different machines by running out of root fs space. I think the default / is around 50GB, and it's too easy to fill: during OS update we need space for three copies of each package: the old version, the downloaded new version, and the space to install the new version.
75G on new installs today but yes there are many folks still with a 50G root volume at /
And changing this to 80+G is sorta 'kick the can' but also as it turns out it doesn't really fix the problem that well and puts pressure on /home in cases where the laptop drive is kinda small. There are other valid ways to solve this single problem, e.g. a single plain ext4 or xfs volume. But both of those leave things on the table users benefit from.
Of course it isn't all about features. If it's just a feature contest btrfs wins somewhat dramatically. What's going to make the feature successful is the community backing it up. The change needs the desire and resources of Fedora more than just features. A dozen owners on the proposal hopefully gives confidence that it's serious, but it's going to take more than that.
On Fri, Jun 26, 2020 at 05:49:03PM -0600, Chris Murphy wrote:
For what it's worth, this is really needed, and overdue. I have repeatedly failed Fedora OS release upgrades on different machines by running out of root fs space. I think the default / is around 50GB, and it's too easy to fill: during OS update we need space for three copies of each package: the old version, the downloaded new version, and the space to install the new version.
75G on new installs today but yes there are many folks still with a 50G root volume at /
And changing this to 80+G is sorta 'kick the can' but also as it turns out it doesn't really fix the problem that well and puts pressure on /home in cases where the laptop drive is kinda small. There are other valid ways to solve this single problem, e.g. a single plain ext4 or xfs volume. But both of those leave things on the table users benefit from.
We cannot do anything for existing installs. It is up to owner to juggle partitions. Also, with btrfs proposal we do not have to decide how to split space between / and /home. Boths are just a subvolumes and share all the space.
On 6/26/20 2:22 PM, Przemek Klosowski via devel wrote:
Even though technically dnf system-upgrade can --download-dir to a location off / it doesn't seem to work with the actual upgrade, so the only way I know is to delete largest packages (flightGear*, piglit*, KiCAD*, ...) and reinstall them after update.
Somewhat off-topic, but you can symlink /var/lib/dnf/system-upgrade to somewhere else and all the downloaded packages will be stored there and the upgrade will still work. (As long as the linked storage is automatically mounted at boot.) I've even used a USB flash drive for this.
On Fri, Jun 26, 2020 at 04:58:19PM +0200, Vitaly Zaitsev via devel wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
I don't have any info to either confirm or refute this assertion, but I want to say we should be careful to actually compare apples to apples.
btrfs is not a 1-1 equivalent of ext4, because the scope of btrfs is much broader. It should likely be compared against some combo of existing functionality, such as ext4+devicemapper, to get a fairer picture.
It isn't just a matter of whether the kernel parts are reliable. It is also important how well the userspace tools fit together to form the end user solution. This impacts how likely it is for the user to shoot themselves in the foot when making changes to their storage stack.
Regards, Daniel
On Fri, Jun 26, 2020 at 05:32:45PM +0100, Daniel P. Berrangé wrote:
btrfs is not a 1-1 equivalent of ext4, because the scope of btrfs is much broader. It should likely be compared against some combo of existing functionality, such as ext4+devicemapper, to get a fairer picture.
Well, specifically, we should compare the existing default partitioning scheme.
On 26 June 2020 16:58:19 CEST, Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
Such changes could affect Fedora reputation among other distributions.
I strongly agree. BTRFS has been 5 years from production ready for almost a decade now, please don't force this on users that doesn't know any better.
On Fri, Jun 26, 2020 at 8:45 pm, Markus Larsson qrsbrwn@uidzero.se wrote:
I strongly agree. BTRFS has been 5 years from production ready for almost a decade now, please don't force this on users that doesn't know any better.
This is hard to square with the fact that it's already being used in production on millions of systems. It's also hard to square with the data presented by Josef -- the only hard evidence I've seen on the topic of filesystem reliability -- which shows btrfs is an order of magnitude more reliable than xfs (although we don't know how it compares to ext4). Surely if xfs is good enough for RHEL, and btrfs is at least 10x more reliable than xfs, that suggests btrfs should probably be good enough for Fedora?
Do you have any real evidence for your claim that would be more convincing than what Josef has presented?
On 26 June 2020 21:04:00 CEST, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Fri, Jun 26, 2020 at 8:45 pm, Markus Larsson qrsbrwn@uidzero.se wrote:
I strongly agree. BTRFS has been 5 years from production ready for almost a decade now, please don't force this on users that doesn't know any better.
This is hard to square with the fact that it's already being used in production on millions of systems. It's also hard to square with the data presented by Josef -- the only hard evidence I've seen on the topic of filesystem reliability -- which shows btrfs is an order of magnitude more reliable than xfs (although we don't know how it compares to ext4). Surely if xfs is good enough for RHEL, and btrfs is at least 10x more reliable than xfs, that suggests btrfs should probably be good enough for Fedora?
Do you have any real evidence for your claim that would be more convincing than what Josef has presented?
Josef's server parks is a bit of a different use case than laptops as other people has already pointed out. If you want data on how it works in a desktop/laptop scenario talk to openSUSE users about how many times the "btrfs randomly ate my volume"-bug was "fixed".
When I ran an environment of about 4500 SLES and about 5000 RHEL servers btrfs failed about 3 times as often as xfs (this from our own in-house statistics). That was 3 years ago but filesystems takes long to mature and I have been keeping ear near openSUSE to see where it goes. Is this as big as Josef's environment? No but it is first hand data to me (to you it is of course just anecdotal evidence)
BTRFS has the potential to become great, I just think it isn't there yet and it'll take 5 years of smooth sailing to convince me.
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Fri, 2020-06-26 at 21:22 +0200, Markus Larsson wrote:
On 26 June 2020 21:04:00 CEST, Michael Catanzaro < mcatanzaro@gnome.org> wrote:
On Fri, Jun 26, 2020 at 8:45 pm, Markus Larsson qrsbrwn@uidzero.se wrote:
I strongly agree. BTRFS has been 5 years from production ready for almost a decade now, please don't force this on users that doesn't know any better.
This is hard to square with the fact that it's already being used in production on millions of systems. It's also hard to square with the data presented by Josef -- the only hard evidence I've seen on the topic of filesystem reliability -- which shows btrfs is an order of magnitude more reliable than xfs (although we don't know how it compares to ext4). Surely if xfs is good enough for RHEL, and btrfs is at least 10x more reliable than xfs, that suggests btrfs should probably be good enough for Fedora?
Do you have any real evidence for your claim that would be more convincing than what Josef has presented?
Josef's server parks is a bit of a different use case than laptops as other people has already pointed out. If you want data on how it works in a desktop/laptop scenario talk to openSUSE users about how many times the "btrfs randomly ate my volume"-bug was "fixed".
When I ran an environment of about 4500 SLES and about 5000 RHEL servers btrfs failed about 3 times as often as xfs (this from our own in-house statistics). That was 3 years ago but filesystems takes long to mature and I have been keeping ear near openSUSE to see where it goes. Is this as big as Josef's environment? No but it is first hand data to me (to you it is of course just anecdotal evidence)
Keep in mind that SLES does backport btrfs patches because they support it. RHEL does not. And we are talking about Fedora here anyway.
BTRFS has the potential to become great, I just think it isn't there yet and it'll take 5 years of smooth sailing to convince me.
Probably you should try it with Fedora's kernel?
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
- -- Igor Raits ignatenkobrain@fedoraproject.org
On 26 June 2020 21:32:31 CEST, Igor Raits ignatenkobrain@fedoraproject.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Josef's server parks is a bit of a different use case than laptops as other people has already pointed out. If you want data on how it works in a desktop/laptop scenario talk to openSUSE users about how many times the "btrfs randomly ate my volume"-bug was "fixed".
When I ran an environment of about 4500 SLES and about 5000 RHEL servers btrfs failed about 3 times as often as xfs (this from our own in-house statistics). That was 3 years ago but filesystems takes long to mature and I have been keeping ear near openSUSE to see where it goes. Is this as big as Josef's environment? No but it is first hand data to me (to you it is of course just anecdotal evidence)
Keep in mind that SLES does backport btrfs patches because they support it. RHEL does not. And we are talking about Fedora here anyway.
We didn't use BTRFS on any RHEL machines only on the SLES ones.
BTRFS has the potential to become great, I just think it isn't there yet and it'll take 5 years of smooth sailing to convince me.
Probably you should try it with Fedora's kernel?
Oh I will, when BTRFS has had smooth sailing for 5 years. I have no problem with others running btrfs, I just think it should be the default since we seem to be all about not creating problems for new users. That's at least what the recent changes tell me :)
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Igor Raits ignatenkobrain@fedoraproject.org -----BEGIN PGP SIGNATURE-----
iQIzBAEBCgAdFiEEcwgJ58gsbV5f5dMcEV1auJxcHh4FAl72TU8ACgkQEV1auJxc Hh759w//XHCXloEj6QAUNpVxCEljVwm1WQVl1jfH3p+mex1a5Dan242COXkVaEzy 6zR79EZf7ONg1dTU41fq1mg3gWkFAE/q+OD4cSJ/Jbwyt/L+L40MgD1h7UmNo0/P uytLZYC3BUIq9ARAH2DlYMHSQUcYZ8TOyrlxWUmkyqPnc99D9CkkqReRjWA/EtYi mVNOzCQwdMefSJu6+HZlFIhyYeyBbmfu/Q0v5uQE9CQbmN/AuyTHmWG3jRYTINxg 7w8vFPLwjUEmUno+i0Jvkdr4EqSZihV4ljoA0MO8OEADHamjnUOWX8HiFN6E6y+V cDXPvVTqdf7v+Hz6j6F2cUDbm6PQrbd5fODMeCVibuE5knDB587jRcrqXYfSp+wL 66VRnHXYrOAMHXKlcs+XpPxkqfy5AdgvkP63PUZTWb4yb4wElVVpFNsBf2wk7TXu kp9cKSf+1CSaIq0oD1uY9YB4Xm9elI3pRJJHuH8TrOKI4RsxnmjXdpXB+pzNf8BH 8PQex0mAwcvefiK0MfaJcl6cP9PgIvvAb75OoWulEsXGG9uPT1ZknYwgXPFN+eDs T5Wr/7957eiDDgYDtxPXQfliI58AtnCh1ysNcEf5vRLEARs3HLT8Mo+Z+o78ZvpG ZNYkixPYKGrGrUdLJzwXqlQAy6wlNXDzTIxPtrXy5DHMkuAAAqo= =2hBc -----END PGP SIGNATURE----- _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Fri, Jun 26, 2020 at 2:05 PM Michael Catanzaro mcatanzaro@gnome.org wrote:
On Fri, Jun 26, 2020 at 8:45 pm, Markus Larsson qrsbrwn@uidzero.se wrote:
I strongly agree. BTRFS has been 5 years from production ready for almost a decade now, please don't force this on users that doesn't know any better.
This is hard to square with the fact that it's already being used in production on millions of systems. It's also hard to square with the data presented by Josef -- the only hard evidence I've seen on the topic of filesystem reliability -- which shows btrfs is an order of magnitude more reliable than xfs (although we don't know how it compares to ext4). Surely if xfs is good enough for RHEL, and btrfs is at least 10x more reliable than xfs, that suggests btrfs should probably be good enough for Fedora?
Saying production on millions of systems is a bit misleading here, when you are talking about millions of systems at a single company.
On 2020-06-26 22:13, Justin Forbes wrote:
Saying production on millions of systems is a bit misleading here, when you are talking about millions of systems at a single company.
...in a redundant configuration where losing a disk is tolerated by design and managing data that have very low vale (mostly pictures of cats and random chats).
Filesystem quality must be measured in other conditions: have a Postgres on it, financial transactions, random blackouts, etc.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Sat, 2020-06-27 at 10:35 +0200, Roberto Ragusa wrote:
On 2020-06-26 22:13, Justin Forbes wrote:
Saying production on millions of systems is a bit misleading here, when you are talking about millions of systems at a single company.
...in a redundant configuration where losing a disk is tolerated by design and managing data that have very low vale (mostly pictures of cats and random chats).
Filesystem quality must be measured in other conditions: have a Postgres on it, financial transactions, random blackouts, etc.
Do you run postgres, financial transactions and random blackouts on your laptop / workstation? If so, isn't it just for testing purposes?
I'm not saying that it is not important to have filesystem stable and such, but just saying that typical workstation workloads is not utilizing disks that much (if at all?).
-- Roberto Ragusa mail at robertoragusa.it _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
- -- Igor Raits ignatenkobrain@fedoraproject.org
Le samedi 27 juin 2020 à 10:47 +0200, Igor Raits a écrit :
Do you run postgres, financial transactions and random blackouts on your laptop / workstation? If so, isn't it just for testing purposes?
Wokstations are full of high-value personnal data, because home users do not have an IT organisation to back it up in a professional way.
On Sat, Jun 27, 2020 at 10:59:57AM +0200, Nicolas Mailhot via devel wrote:
Le samedi 27 juin 2020 à 10:47 +0200, Igor Raits a écrit :
Do you run postgres, financial transactions and random blackouts on your laptop / workstation? If so, isn't it just for testing purposes?
Wokstations are full of high-value personnal data, because home users do not have an IT organisation to back it up in a professional way.
That's why hav my personal, valuable and irreplacable data (photos, contracts, etc.) on btrfs raid1. I do backups regularly, but only btrfs is able to catch and correct silent corruptions. Which do happen. Without btrfs, I could be happily backing up corrupted photos.
On Saturday, June 27, 2020, Nicolas Mailhot via devel < devel@lists.fedoraproject.org> wrote:
Le samedi 27 juin 2020 à 10:47 +0200, Igor Raits a écrit :
Do you run postgres, financial transactions and random blackouts on your laptop / workstation? If so, isn't it just for testing purposes?
Wokstations are full of high-value personnal data, because home users do not have an IT organisation to back it up in a professional way.
Speaking of backups some popular (even cheap) NAS systems for "home users" do use btrfs - those users also do not have professional IT support to help them.
-- Nicolas Mailhot _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject. org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists. fedoraproject.org
On Sat, Jun 27, 2020 at 06:41:17PM +0200, drago01 wrote:
Speaking of backups some popular (even cheap) NAS systems for "home users" do use btrfs - those users also do not have professional IT support to help them.
Um, yes they do -- the folks who supplied the NAS software, aka the device manufacturer. For that equipment, btrfs is an implementation detail, completely hidden from the user.
- Solomon
On 2020-06-27 10:47, Igor Raits wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Sat, 2020-06-27 at 10:35 +0200, Roberto Ragusa wrote:
On 2020-06-26 22:13, Justin Forbes wrote:
Saying production on millions of systems is a bit misleading here, when you are talking about millions of systems at a single company.
...in a redundant configuration where losing a disk is tolerated by design and managing data that have very low vale (mostly pictures of cats and random chats).
Filesystem quality must be measured in other conditions: have a Postgres on it, financial transactions, random blackouts, etc.
Do you run postgres, financial transactions and random blackouts on your laptop / workstation? If so, isn't it just for testing purposes?
No, but I do run on my laptop/workstation the same technologies that have been proven to be good for serious stuff. That is the fundamental Linux advantage, or at least has always been, and that's why I'm using Linux daily since when other people were waiting for the release of Win95.
On Sat, Jun 27, 2020 at 9:30 AM Roberto Ragusa mail@robertoragusa.it wrote:
On 2020-06-27 10:47, Igor Raits wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Sat, 2020-06-27 at 10:35 +0200, Roberto Ragusa wrote:
On 2020-06-26 22:13, Justin Forbes wrote:
Saying production on millions of systems is a bit misleading here, when you are talking about millions of systems at a single company.
...in a redundant configuration where losing a disk is tolerated by design and managing data that have very low vale (mostly pictures of cats and random chats).
Filesystem quality must be measured in other conditions: have a Postgres on it, financial transactions, random blackouts, etc.
Do you run postgres, financial transactions and random blackouts on your laptop / workstation? If so, isn't it just for testing purposes?
No, but I do run on my laptop/workstation the same technologies that have been proven to be good for serious stuff. That is the fundamental Linux advantage, or at least has always been, and that's why I'm using Linux daily since when other people were waiting for the release of Win95.
By that metric, Btrfs qualifies, as it's the default filesystem on SUSE Linux Enterprise (and has been since 2014). SUSE has built several products specifically on top of Btrfs, including their Kubernetes product, which relies on Btrfs features to offer safety and high performance.
And Facebook runs it for nearly all their infrastructure, as noted by Josef upthread.
Google uses it to power the Crostini Linux on Chrome OS environment system.
Synology uses it for their NAS products by default since 2016 with DSM 6.0.
I can keep going on, but I think this shows that Btrfs is a mature, battle-tested filesystem used for *very* serious workloads.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Sat, Jun 27, 2020 at 09:39:36AM -0400, Neal Gompa wrote:
By that metric, Btrfs qualifies, as it's the default filesystem on SUSE Linux Enterprise (and has been since 2014). SUSE has built
One thing I'd like to see addressed.
Back in the RHEL7.4 days, btrfs was explicitly deprecated:
"The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.
"The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.
So, why did SuSE consider BTRFS "ready" while RedHat did not, to the point of removing support for it? And what has changed since then?
- Solomon
On 27 June 2020 16:17:16 CEST, Solomon Peachy pizza@shaftnet.org wrote:
On Sat, Jun 27, 2020 at 09:39:36AM -0400, Neal Gompa wrote:
By that metric, Btrfs qualifies, as it's the default filesystem on SUSE Linux Enterprise (and has been since 2014). SUSE has built
One thing I'd like to see addressed.
Back in the RHEL7.4 days, btrfs was explicitly deprecated:
"The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.
"The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.
So, why did SuSE consider BTRFS "ready" while RedHat did not, to the point of removing support for it? And what has changed since then?
I don't know how to say this without throwing shade so here goes anyway. Anyone who has worked with both RHEL and SLES systems knows why. My feelings from working with both products in large scale heterogeneous environments is that SLES is many factors less reliable than RHEL. I don't know exactly how and why because SuSE has many many talented people on payroll and do good work in many areas it's just that when it's time to put SLES together it just isn't very reliable. I'm sorry for the harsh words I just don't know how to put it in any other way.
/Markus
On Sat, Jun 27, 2020 at 10:17 AM Solomon Peachy pizza@shaftnet.org wrote:
On Sat, Jun 27, 2020 at 09:39:36AM -0400, Neal Gompa wrote:
By that metric, Btrfs qualifies, as it's the default filesystem on SUSE Linux Enterprise (and has been since 2014). SUSE has built
One thing I'd like to see addressed.
Back in the RHEL7.4 days, btrfs was explicitly deprecated:
"The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.
"The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.
So, why did SuSE consider BTRFS "ready" while RedHat did not, to the point of removing support for it? And what has changed since then?
Red Hat deprecated it because they have zero engineers knowledgeable about Btrfs in a way that they could regularly and meaningfully contribute to its development upstream and maintain it for the Red Hat Enterprise Linux kernel. They all left for different companies over the past several years. That situation has not changed at Red Hat to the best of my knowledge.
However, Fedora, as the cutting edge platform that uses new technologies first, is not bound by Red Hat's lack of staff on Btrfs. Indeed, one of the change owners (Josef Bacik) does not work at Red Hat, but is an upstream Btrfs developer who is helping to push this change.
Perhaps with Fedora adopting Btrfs, this may change in the future. I do not know. But as a Fedoran, I want Fedora to use the best technology we have to solve problems. My firm belief is that Btrfs is that for the problems we are facing today.
-- 真実はいつも一つ!/ Always, there's only one truth!
By that metric, Btrfs qualifies, as it's the default filesystem on SUSE Linux Enterprise (and has been since 2014). SUSE has built
One thing I'd like to see addressed.
Back in the RHEL7.4 days, btrfs was explicitly deprecated:
"The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.
"The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.
So, why did SuSE consider BTRFS "ready" while RedHat did not, to the point of removing support for it? And what has changed since then?
I suspect, but I do not know so it's purely my own opinion, it was because there was not the internal knowledge to be able to support the filesystem, or yet another filesystem, for paying customers. Adding support for something like a filesystem where you have paying customers is not something taken lightly, customers tend to like their data and enterprise support is only as good as their response when the absolute worst possible thing happens. Having worked at hosting providers and been consulting onsite at some very large companies I know from experience that a lot of enterprises often take more time over decisions to change storage platforms and options around storage than probably all other decisions combined. An outage of a load balancer or a network switch can be dealt with via resiliency and replacing them is quite straight foward if a device doesn't live up to expectations, storage is quite the opposite and data corruption is often not easy to recover from so things like new filesystems are not something that's taken lightly for some customers.
Peter
On 6/27/20 4:35 AM, Roberto Ragusa wrote:
On 2020-06-26 22:13, Justin Forbes wrote:
Saying production on millions of systems is a bit misleading here, when you are talking about millions of systems at a single company.
...in a redundant configuration where losing a disk is tolerated by design and managing data that have very low vale (mostly pictures of cats and random chats).
Huh? I can assure you that we care very much about our users data, and do not lose "cat pictures" randomly and call it a day. If you are going to make technical arguments then I'm happy to talk about actual issues, but insulting the hard work we put into maintaining a very high quality production environment is not helpful or relevant. Thanks,
Josef
On Fri, 2020-06-26 at 14:04 -0500, Michael Catanzaro wrote:
[...]
Surely if xfs is good enough for RHEL, and btrfs is at least 10x more reliable than xfs, that suggests btrfs should
This is a good argument for having Fedora officially support BtrFS as a possible installation option, yes; but a _default_ filesystem needs to be absolutely tried-and-true, and I believe that BtrFS has not yet been put through its paces well enough for this. Ext4 should remain the default FS for now.
BtrFS might have significant feature advantage over Ext4, yes; but so far it has only seen production use from a handful of companies, and even then, not for very long. (The BtrFS wiki page [1] lists most of these users only as of late-2018 -- just under two years.)
In contrast to this, Ext4 has been the default FS in many enterprise systems for well over a decade: For instance, Google transitioned its systems to Ext4 in January 2010 [2], and transitioned Android to Ext4 in December 2010 [3]. And most modern Linux distros made Ext4 their default filesystem at around the same time.
Moreover, putting aside any issues of stability, Ext4 peformance and interactivity continues to beat that of BtrFS except in very specific scenarios. For example, in this December 2019 Phoronix benchmark [4], BtrFS was better under some RAID setups, but consistently slightly worse than Ext4 for the single disk use cases; and considering that most desktops and laptops (minus some workstation-class PCs and laptops, like a ThinkPad P-series perhaps) use single disks instead of RAID arrays, it does not make sense yet to suggest a default filesystem that will hamper performance at the cost of features (like support for 16 EiB volumes and snapshots) that most users will probably not use or care about.
In summary: Yes, it would be very cool for Fedora to support BtrFS; but it should not yet be the default filesystem. It still needs a lot of time to mature and stabilize.
[1] htps://btrfs.wiki.kernel.org/index.php/Production_Users [2] https://lists.openwall.net/linux-ext4/2010/01/04/8 [3] https://thunk.org/tytso/blog/2010/12/12/android-will-be-using-ext4-starting-...
[4] https://www.phoronix.com/scan.php?page=article&item=linux54-hdd-raid
On Fri, Jun 26, 2020 at 6:17 PM Peter Gordon peter@thecodergeek.com wrote:
This is a good argument for having Fedora officially support BtrFS as a possible installation option, yes;
It already is a release blocking (supported) file system for install time option. Has been for ~10 years.
BtrFS might have significant feature advantage over Ext4, yes; but so far it has only seen production use from a handful of companies, and even then, not for very long. (The BtrFS wiki page [1] lists most of these users only as of late-2018 -- just under two years.)
Facebook since 2015. SUSE/openSUSE on the desktop and on servers since 2014, by default. Are you suggesting they can do it and we can't?
In contrast to this, Ext4 has been the default FS in many enterprise systems for well over a decade: For instance, Google transitioned its systems to Ext4 in January 2010 [2], and transitioned Android to Ext4 in December 2010 [3]. And most modern Linux distros made Ext4 their default filesystem at around the same time.
Google has been using btrfs as part of Crostini, which I mention up thread, as the file system to support native Linux apps on Chrome OS. It would appear they're choosing different things for different purposes to solve specific problems.
And in Fedora we think users want to improve the life of their hardware, get better efficiency with reflinks and snapshots for containers, and improve the responsiveness of the desktop by including IO isolation as part of a better resource control solution, and not have corrupt data pass through to user space or to their backups, silently. Btrfs provides all of these things, and helps solve users' problems.
On 27 June 2020 03:21:32 CEST, Chris Murphy lists@colorremedies.com wrote:
On Fri, Jun 26, 2020 at 6:17 PM Peter Gordon peter@thecodergeek.com wrote:
Facebook since 2015. SUSE/openSUSE on the desktop and on servers since 2014, by default. Are you suggesting they can do it and we can't?
There's a difference between "can" and "should". I find this "<other guy> can do this are you less of a man than <other guy>" tiresome. When SLES made the switch they did only recommend it for system data not production data because it kept breaking and data loss is painful. That was still the case 3 years ago, if they have reconsidered it has been done later than that. It's very clear from both the openSUSE and the Arch community that btrfs has higher failure rates than ext4 and the rate of catastrophic failure is non-negligible. To push for btrfs is doing a disservice to the new users and the not yet competent.
On Sat, Jun 27, 2020 at 3:12 AM Markus Larsson qrsbrwn@uidzero.se wrote:
There's a difference between "can" and "should". I find this "<other guy> can do this are you less of a man than <other guy>" tiresome.
Yes, I also find it tiresome when people make grandiose claims of having facts on their side, and yet provide none, but inject hyperbole into the conversation instead.
When SLES made the switch they did only recommend it for system data not production data because it kept breaking and data loss is painful. That was still the case 3 years ago, if they have reconsidered it has been done later than that. It's very clear from both the openSUSE and the Arch community that btrfs has higher failure rates than ext4 and the rate of catastrophic failure is non-negligible.
Excellent! Provide the data. I'm looking forward to seeing this very clear data. You can provide it, today?
I'm not using Btrfs because Facebook does or SUSE does. I'm using it because I trust it, I value my data, I value the contents of my wallet (money), and I'm saving time overall in my myriad use of it for work and for testing Fedora. I've seen btrfs catch corruption other file systems aren't designed to, however rare these are at an individual level. Every day I benefit from compression, reflink copies and snapshots, however incremental that benefit. And yet, I mostly interact with it just like any other file system. It is an exceptionally ordinary experience most of the time.
To push for btrfs is doing a disservice to the new users and the not yet competent.
This is not at all persuasive.
On 27 June 2020 17:55:09 CEST, Chris Murphy lists@colorremedies.com wrote:
On Sat, Jun 27, 2020 at 3:12 AM Markus Larsson qrsbrwn@uidzero.se wrote:
There's a difference between "can" and "should". I find this "<other guy> can do this are you less of a man than <other guy>" tiresome.
Yes, I also find it tiresome when people make grandiose claims of having facts on their side, and yet provide none, but inject hyperbole into the conversation instead.
When SLES made the switch they did only recommend it for system data not production data because it kept breaking and data loss is painful. That was still the case 3 years ago, if they have reconsidered it has been done later than that. It's very clear from both the openSUSE and the Arch community that btrfs has higher failure rates than ext4 and the rate of catastrophic failure is non-negligible.
Excellent! Provide the data. I'm looking forward to seeing this very clear data. You can provide it, today?
The actual data I will never ever be able to share. I have ended my time at that particular company but even when I was there I was not permitted to share such data. Or did you mean data from openSUSE and Arch? Just have a look at their bug trackers. You can dismiss it as anecdotal, that's fine. You could also try to see why someone would get the view that I hold. I have no problem with Fedora supporting btrfs, I have a problem with having it as the default option. This is because my experience tells me that it isn't ready yet. Josef has a different view and that's good, even fine tbh. Disagreement is good, that's how mistakes are avoided.
That said, arguing doesn't do much good now, the decision looks like it has already been made.
On Sat, Jun 27, 2020 at 10:21 AM Markus Larsson qrsbrwn@uidzero.se wrote:
The actual data I will never ever be able to share. I have ended my time at that particular company but even when I was there I was not permitted to share such data. Or did you mean data from openSUSE and Arch?
Whatever data makes the claim "very clear."
Just have a look at their bug trackers. You can dismiss it as anecdotal, that's fine. You could also try to see why someone would get the view that I hold. I have no problem with Fedora supporting btrfs, I have a problem with having it as the default option. This is because my experience tells me that it isn't ready yet. Josef has a different view and that's good, even fine tbh. Disagreement is good, that's how mistakes are avoided.
I agree which is why we need to be very clear about what you mean by failure. Intrinsic btrfs failures? Or that btrfs is more sensitive to hardware failures?
And also your recommendation necessarily means choosing a shorter lifespan for more people's hardware. It means leaving other useful features we could take advantage of, off the table. There is a choice to be made, no matter what.
How do you assess the value of extending the life of most people's hardware, to the negative UX shift in the disaster recovery pattern? That is difficult to assess objectively, so I don't dispute a subjective component to this evaluation. But we have to be clear about all the parts being evaluated and not just focus on worry.
That said, arguing doesn't do much good now, the decision looks like it has already been made.
It is definitely not made.
On 27 June 2020 17:55:09 CEST, Chris Murphy <lists(a)colorremedies.com> wrote:
The actual data I will never ever be able to share. I have ended my time at that particular company but even when I was there I was not permitted to share such data. Or did you mean data from openSUSE and Arch? Just have a look at their bug trackers.
Our bugtracker (openSUSE bugzilla that is) has been curiously silent about btrfs issues recently. Actually ever since we switched from btrfs + xfs setup to pure btrfs (with improved subvolume layout) we have seen way less complaints about most of the issues the users had previously.
LCP [Stasiek] https://lcp.world
On Sat, 2020-06-27 at 22:59 +0000, Stasiek Michalski wrote:
On 27 June 2020 17:55:09 CEST, Chris Murphy <lists(a)colorremedies.com> wrote:
The actual data I will never ever be able to share. I have ended my time at that particular company but even when I was there I was not permitted to share such data. Or did you mean data from openSUSE and Arch? Just have a look at their bug trackers.
Our bugtracker (openSUSE bugzilla that is) has been curiously silent about btrfs issues recently. Actually ever since we switched from btrfs + xfs setup to pure btrfs (with improved subvolume layout) we have seen way less complaints about most of the issues the users had previously.
(Hi LCP I hope life is good) That's great to hear, just a few questions. Lately, how would you rate that in number of years? While you seem to have pulled through, would you say the switch btrfs as default has been painful? But to summarize, I'm mainly glad you have fewer issues with btrfs now.
/M
LCP [Stasiek] https://lcp.world _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Sat, 2020-06-27 at 22:59 +0000, Stasiek Michalski wrote:
(Hi LCP I hope life is good) That's great to hear, just a few questions. Lately, how would you rate that in number of years?
The change to partitioning occurred in November of 2018 iirc, so it's over 1.5 years
While you seem to have pulled through, would you say the switch btrfs as default has been painful?
Oh yeah, it was a nightmare. If you plan on using ancient Kernels, you will not have a particularly great time with certain btrfs features either. The initial switch happened in 2014, and at the time, and the lack of confidence showed itself through the choice of secondary filesystem for /home. However, the reputation has been more problematic than the filesystem I feel like, since every single problem with the low level stuff was attributed to btrfs for so many years now
But to summarize, I'm mainly glad you have fewer issues with btrfs now.
I'm happier the users have less issues
LCP [Stasiek] https://lcp.world
On Sat, Jun 27, 2020 at 6:47 PM Stasiek Michalski stasiek@michalski.cc wrote:
On Sat, 2020-06-27 at 22:59 +0000, Stasiek Michalski wrote:
(Hi LCP I hope life is good) That's great to hear, just a few questions. Lately, how would you rate that in number of years?
The change to partitioning occurred in November of 2018 iirc, so it's over 1.5 years
While you seem to have pulled through, would you say the switch btrfs as default has been painful?
Oh yeah, it was a nightmare. If you plan on using ancient Kernels, you will not have a particularly great time with certain btrfs features either. The initial switch happened in 2014, and at the time, and the lack of confidence showed itself through the choice of secondary filesystem for /home. However, the reputation has been more problematic than the filesystem I feel like, since every single problem with the low level stuff was attributed to btrfs for so many years now
I wonder to what degree some of the problems, especially enospc bugs, were exacerbated by a somewhat small root for btrfs combined with a fairly aggressive snapshotting regime by default? I agree with the "shoot the messenger" problem with btrfs. It's a victim of its own design: reports the facts, but doesn't assign blame.
I'm happier the users have less issues
Agreed. What do you think are the biggest remaining issues you have with btrfs? Or even not directly btrfs, but side effects that are still unresolved? Any desktop integration issues that stand out in particular?
On Sat, Jun 27, 2020 at 6:47 PM Stasiek Michalski <stasiek(a)michalski.cc> wrote:
I wonder to what degree some of the problems, especially enospc bugs, were exacerbated by a somewhat small root for btrfs combined with a fairly aggressive snapshotting regime by default? I agree with the "shoot the messenger" problem with btrfs. It's a victim of its own design: reports the facts, but doesn't assign blame.
Yeah, some mistakes were made when handling the root size, some other issues with openQA when trying to fix it, Richard Brown had fun couple of weeks with that stuff, but it was all worth the effort. We didn't change much with how aggressively everything is snapshotted, because in practice, since most desktop updates are done on live systems (obviously excluding ro filesystems with transactional/atomic updates), everything can go wrong, both pre and post the transaction, so every snapshot might be the one you need
Agreed. What do you think are the biggest remaining issues you have with btrfs? Or even not directly btrfs, but side effects that are still unresolved? Any desktop integration issues that stand out in particular?
There is no gui for basically anything btrfs related anywhere, since SUSE has had close to 0 interest in desktop for around 10 years. Since I heard there is nobody maintaining gnome-disk-utility, I might have some motivation to help out with it, since I am a huge fan of it, so we will see how much time I have over the coming weeks to implement things there. We wouldn't want it to die like banshee, would we?
LCP [Stasiek] https://lcp.world
On Sat, Jun 27, 2020 at 8:05 PM Stasiek Michalski stasiek@michalski.cc wrote:
Yeah, some mistakes were made when handling the root size, some other issues with openQA when trying to fix it, Richard Brown had fun couple of weeks with that stuff, but it was all worth the effort. We didn't change much with how aggressively everything is snapshotted, because in practice, since most desktop updates are done on live systems (obviously excluding ro filesystems with transactional/atomic updates), everything can go wrong, both pre and post the transaction, so every snapshot might be the one you need
Can you elaborate on the sorts of reasons you'd need the pre rolled back versus the post? I imagine one is more common to use as a rollback than the other.
Agreed. What do you think are the biggest remaining issues you have with btrfs? Or even not directly btrfs, but side effects that are still unresolved? Any desktop integration issues that stand out in particular?
There is no gui for basically anything btrfs related anywhere, since SUSE has had close to 0 interest in desktop for around 10 years. Since I heard there is nobody maintaining gnome-disk-utility, I might have some motivation to help out with it, since I am a huge fan of it, so we will see how much time I have over the coming weeks to implement things there. We wouldn't want it to die like banshee, would we?
That would be cool. There are some notes about this in the tracker for the proposal we're using, #153. In particular when I think of the layout (open)SUSE is using, I'd think you probably don't want to show all subvolumes in this interface, let alone subvolume snapshots (many of those on an (open)SUSE system!)
On Sat, Jun 27, 2020 at 8:05 PM Stasiek Michalski <stasiek(a)michalski.cc> wrote:
Can you elaborate on the sorts of reasons you'd need the pre rolled back versus the post? I imagine one is more common to use as a rollback than the other.
Post is usually used when something else goes wrong with the system, outside of use cases foreseen by the automated snapshots. So it depends when the issue happens, and which part of the system caused it. Obviously we can't expect the user to make snapshots before doing something potentially dangerous, so after the last update seems like a good restore point for a system
That would be cool. There are some notes about this in the tracker for the proposal we're using, #153. In particular when I think of the layout (open)SUSE is using, I'd think you probably don't want to show all subvolumes in this interface, let alone subvolume snapshots (many of those on an (open)SUSE system!)
Yup, I already had a look in places, to see what is needed, and what people expect from gdu and associated utilities
LCP [Stasiek] https://lcp.world
On Sun, Jun 28, 2020 at 2:05 am, Stasiek Michalski stasiek@michalski.cc wrote:
There is no gui for basically anything btrfs related anywhere, since SUSE has had close to 0 interest in desktop for around 10 years. Since I heard there is nobody maintaining gnome-disk-utility, I might have some motivation to help out with it, since I am a huge fan of it, so we will see how much time I have over the coming weeks to implement things there. We wouldn't want it to die like banshee, would we?
It's being maintained by Kai Lüke, but certainly doesn't appear to be under active development. I'm sure he would appreciate help. :) Certainly, nobody has volunteered to work on btrfs support there. I know udisks2 has btrfs API, though, which should help.
I'm strongly against this proposal. BTRFS is the most unstable file system I ever seen. It can break up even under an ideal conditions and lead to a complete data loss. There are lots of complaints and bug reports in Linux kernel bugzilla and Reddit.
Without providing evidence, this is just unsubstantiated FUD. As with any piece of software, also btrfs may have bugs, but the only know issue which may have implications with regard to data loss are the raid5/6 write holes, which are documented on btrfs' gotchas [1] and status [2] pages. However, there are many other good reasons why raid5/6 configurations should be avoided - with any filesystem. For more detailed explanations see [3] and [4]. Even though these articles are written for ZFS, the drawbacks around raid5/6 apply equally well to other filesystems.
Also, to note, many of the early "issues" and "bug" reports around btrfs were due to user-space utilities such as snapper. I ran into some of these issues myself (specifically a issue with meta-data on openSUSE at around 2012), which made me very sceptical of btrfs for several years. However, I recently did my research on modern filesystems when setting up a home NAS and came to the conclusion that ZFS and btrfs are the best filesystems currently available. I subsequently opted for ZFS due to the excellent community support and user-space utilities, and do not only use ZFS for the NAS, but also on my Fedora laptop.
I personally like the articles on modern filesystems by Jim Salter, where especially the one from 2014 discusses the advantages of ZFS and btrfs [5]. (That article was a excellent entry point to the topic for, and is especially well suited for people otherwise not really familiar with this topic.)
Armin
[1] https://btrfs.wiki.kernel.org/index.php/Gotchas [2] https://btrfs.wiki.kernel.org/index.php/Status [3] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/ [4] http://nex7.blogspot.com/2013/03/readme1st.html [5] https://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cow...
Wow! Is it 2010 already? Time flies! :)
In seriousness: thanks for all of the effort put into this change proposal, and the impressive list of change owners. I'm following the discussion here with much interest!
I couldn't believe it either when I saw the proposal, so 2010-ish :)
Anyway I'm in great favour of this proposal and I'd love to see btrfs the default. I personally use it in all of my systems (desktops, laptops and workstations) except for servers, where it lacks the reliability on some raid configurations I use (instead I use zfs, which also supports native encryption). I had my share of issues in the early days but it has proven to be extremely reliable lately. My biggest complain nowadays is this: https://lwn.net/Articles/674865/ Not being able to mount partitions of my Raptor Talos Power 9 into my x86 systems annoys me, but I guess it shouldn't bother many people. On the other side I had lots of hardware issues on my Raptor machine and my btrfs hourly snapshots already saved my day multiple times (latest one was while upgrading from Fedora 31 to 32). Would love to see it the default, possibly with full grub rollback integration.
On 6/26/20 11:14 AM, niccolo.belli@linuxsystems.it wrote:
I couldn't believe it either when I saw the proposal, so 2010-ish :)
Anyway I'm in great favour of this proposal and I'd love to see btrfs the default.
Glad to hear!
My biggest complain nowadays is this: https://lwn.net/Articles/674865/ Not being able to mount partitions of my Raptor Talos Power 9 into my x86 systems annoys me, but I guess it shouldn't bother many people. On the other side I had lots of hardware issues on my Raptor machine and my btrfs hourly snapshots already saved my day multiple times (latest one was while upgrading from Fedora 31 to 32).
Sadly that's the present situation, but I think that's being worked on. As desktop ARM hopefully gains in popularity (between Raspberry Pi 4 being almost there, and Apple about to ship ARM-based Macs, the next couple of years will be interesting) - as this happens I could imagine the need for mounting partitions created on Intel on ARM machine and vice versa will be more important.
From what Josef told me, once the kernel's btrfs driver supports this existing filesystems would mount fine cross platform.
Regards,
On Fri, Jun 26, 2020 at 11:00 AM Matthew Miller mattdm@fedoraproject.org wrote:
Wow! Is it 2010 already? Time flies! :)
In seriousness: thanks for all of the effort put into this change proposal, and the impressive list of change owners. I'm following the discussion here with much interest!
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
Michael
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
On Tue, Jun 30, 2020 at 11:22 AM Stephen John Smoogen smooge@gmail.com wrote:
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead.
Right, this is basically what I was trying to say here. I think it's all well and good that the proposal has plenty of support, but the fact of the matter is that the Workstation WG is the set of people who will be stuck with maintaining it long-term, so I'd prefer that they at least get to say "Sure, let's do it", or "No way in Hell can we handle that".
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
On Tue, Jun 30, 2020 at 2:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
And I am driving this as a member of the KDE SIG, though I am a member of both groups. Both the Workstation WG and KDE SIG are responsible groups for this Change. Chris Murphy is the primary driver of this from the Workstation WG side, and I am from the KDE SIG side.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
On Tuesday, June 30, 2020, Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation
deliverable,
I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them
are
Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we
were
not making a ton of progress towards reaching a decision either way,
so
going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed
like
a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At
least,
my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because
FESCo
always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need
to
switch to btfs" (which I highly doubt would happen). So I don't see
any
reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw
btrfs
on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
That argument can be used against any change not restricted to a specific spin. Treating all desktop based spins the same unless there is a reason not to makes sense.
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject. org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists. fedoraproject.org
On Tue, Jun 30, 2020 at 4:30 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
For what it's worth, I asked spin owners from each one before adding them. That's why the change covers them all, they all assented to it. I am doing all the work for it, but I asked for their approval to be covered under this.
Please don't assume such absurd things like that, especially given the list of change owners and listed responsible entities.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Tue, Jun 30, 2020 at 4:02 PM Neal Gompa ngompa13@gmail.com wrote:
On Tue, Jun 30, 2020 at 4:30 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
For what it's worth, I asked spin owners from each one before adding them. That's why the change covers them all, they all assented to it. I am doing all the work for it, but I asked for their approval to be covered under this.
Please don't assume such absurd things like that, especially given the list of change owners and listed responsible entities.
I honestly hadn't considered it until it came up that the Workstation WG has not come to agreement on this change yet. Either way, it is my belief that the spins should be able to decide what they want to use, when they want to use it. If they have bought in, that's great. From a kernel standpoint, the only change being asked here is to make btrfs inline instead of a module. If it is to become the default fs for any spin, I don't have a problem with that.
On Tue, Jun 30, 2020 at 5:19 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 4:02 PM Neal Gompa ngompa13@gmail.com wrote:
On Tue, Jun 30, 2020 at 4:30 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
> For the record, as this directly affects the Workstation deliverable, > I will be voting -1 until and unless the Workstation WG votes in > favor. > > > > Yes, it's a large set of Change owners, but since only two of them are > Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
For what it's worth, I asked spin owners from each one before adding them. That's why the change covers them all, they all assented to it. I am doing all the work for it, but I asked for their approval to be covered under this.
Please don't assume such absurd things like that, especially given the list of change owners and listed responsible entities.
I honestly hadn't considered it until it came up that the Workstation WG has not come to agreement on this change yet. Either way, it is my belief that the spins should be able to decide what they want to use, when they want to use it. If they have bought in, that's great. From a kernel standpoint, the only change being asked here is to make btrfs inline instead of a module. If it is to become the default fs for any spin, I don't have a problem with that.
I submitted it because it was agreed to submit it[1]. I would have waited otherwise.
[1]: https://meetbot.fedoraproject.org/teams/workstation/workstation.2020-06-25-0...
On Tue, Jun 30, 2020 at 4:29 PM Neal Gompa ngompa13@gmail.com wrote:
On Tue, Jun 30, 2020 at 5:19 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 4:02 PM Neal Gompa ngompa13@gmail.com wrote:
On Tue, Jun 30, 2020 at 4:30 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote: > > > On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher > sgallagh@redhat.com wrote: > > > For the record, as this directly affects the Workstation deliverable, > > I will be voting -1 until and unless the Workstation WG votes in > > favor. > > > > > > > > Yes, it's a large set of Change owners, but since only two of them are > > Workstation WG members, they are not representative of that group. > > > > Workstation WG hat on: > > > > I don't think there's any need to vote -1 for that reason alone. The > Workstation WG has discussed the change proposal at several meetings > recently (really, we've spent a long time on this), and frankly we were > not making a ton of progress towards reaching a decision either way, so > going forward with the change proposal and moving the discussion to > devel@ to get feedback from a wider audience and from FESCo seemed like > a good idea. Most likely, we'll wind up doing whatever FESCo chooses > here, but unless FESCo were to explicitly indicate intent to override > the Workstation WG, we would not consider a FESCo decision to limit > what the Workstation WG can do with the Workstation product. At least, > my understanding of the power structure FESCo has established is that > the WG can make product-specific decisions that differ from FESCo's > decisions whenever we want, unless FESCo says otherwise (because FESCo > always has final say). That is, if FESCo were to approve btrfs by > default, but Workstation WG were to vote to stick with ext4, then we > would stick with ext4 unless FESCo were to say "no really, you need to > switch to btfs" (which I highly doubt would happen). So I don't see any > reason to vote -1 here out of concern for overriding the WG. > >
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
For what it's worth, I asked spin owners from each one before adding them. That's why the change covers them all, they all assented to it. I am doing all the work for it, but I asked for their approval to be covered under this.
Please don't assume such absurd things like that, especially given the list of change owners and listed responsible entities.
I honestly hadn't considered it until it came up that the Workstation WG has not come to agreement on this change yet. Either way, it is my belief that the spins should be able to decide what they want to use, when they want to use it. If they have bought in, that's great. From a kernel standpoint, the only change being asked here is to make btrfs inline instead of a module. If it is to become the default fs for any spin, I don't have a problem with that.
I submitted it because it was agreed to submit it[1]. I would have waited otherwise.
So it seems the purpose of the proposal was to generate discussion (which it certainly has), but the Workstation WG has not decided what they really want yet. I do get wanting discussion about it. I do not get how it is a proper change request at this point. Seems very much like "We would like to propose a change that we may or may not do", and if the decision is ultimately to not do it, time was wasted.
On Tue, 30 Jun 2020 at 17:09, Neal Gompa ngompa13@gmail.com wrote:
On Tue, Jun 30, 2020 at 4:30 PM Justin Forbes jmforbes@linuxtx.org wrote:
On Tue, Jun 30, 2020 at 1:39 PM John M. Harris Jr johnmh@splentity.com wrote:
On Tuesday, June 30, 2020 8:22:00 AM MST Stephen John Smoogen wrote:
On Tue, 30 Jun 2020 at 11:09, Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
The problem is that the request as discussed reads as "FESCo says use it for workstation" vs "FESCo has no problem with Workstation saying they want btrfs" or "FESCo says use btrfs as default". Yes it says "desktop variants" but only 1 variant really counts and that is Workstation. So yes, either Workstation agrees to it or it isn't getting voted on. If Workstation can't come to an agreement on it, then the proposal is dead. Anything else is an end-run and a useless trolling of people to see how many rants LWN counts in its weekly messages.
Well, it's not only Workstation that this proposal is trying to throw btrfs on, but the other desktops as well, such as KDE Spin.
How is that even a thing? Shouldn't a spin maintainer be responsible for choosing the defaults of their spin? This proposal seems fairly absurd in the regard of dictating what other people should do.
For what it's worth, I asked spin owners from each one before adding them. That's why the change covers them all, they all assented to it. I am doing all the work for it, but I asked for their approval to be covered under this.
Please don't assume such absurd things like that, especially given the list of change owners and listed responsible entities.
The issue isn't that you haven't done your work. It is that it looks like you were set up to fail. The email from Michael comes across that Workstation couldn't make a decision and told you to go see if FESCO would approve it... but even then they don't have to follow through on it because they are independent. So all that work, all the tantrums from people who just love to fly off the handle on anything, all that bull.. is for essentially nothing. Because in the end, if FESCO does approve it, it means every spin etc is stuck with it while Workstation can decide not to... even though they sent you to get the decision. That is where if I was on FESCO I would say this proposal is dead. Either a Working Group wants something and will fight for it, or they don't. If they don't and have veto authority over anything FESCO says.. then it doesn't matter what FESCO decides.
On Tue, Jun 30, 2020 at 7:16 pm, Stephen John Smoogen smooge@gmail.com wrote:
The issue isn't that you haven't done your work. It is that it looks like you were set up to fail. The email from Michael comes across that Workstation couldn't make a decision and told you to go see if FESCO would approve it... but even then they don't have to follow through on it because they are independent. So all that work, all the tantrums from people who just love to fly off the handle on anything, all that bull.. is for essentially nothing. Because in the end, if FESCO does approve it, it means every spin etc is stuck with it while Workstation can decide not to... even though they sent you to get the decision. That is where if I was on FESCO I would say this proposal is dead. Either a Working Group wants something and will fight for it, or they don't. If they don't and have veto authority over anything FESCO says.. then it doesn't matter what FESCO decides.
At this point, we're discussing a weird corner case where FESCo approves this change proposal and then the WG does not. I guess it's my fault for suggesting that might occur, but it's really not a very likely scenario. Reality is that the WG members are not filesystem experts and after several weeks of discussing the issue, it became clear that we need more feedback from a larger group of developers. That's what the systemwide change proposal process is designed for.
And to be clear, FESCo has veto authority over the WG, not the other way around. The WG was actually created by FESCo itself. I think technically we're a subcommittee of FESCo. Of course we certainly expect that we can ship Fedora Workstation with different defaults than the rest of Fedora, to the extent FESCo continues to allow that.
Michael
On Tue, Jun 30, 2020 at 9:03 PM Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 7:16 pm, Stephen John Smoogen smooge@gmail.com wrote:
The issue isn't that you haven't done your work. It is that it looks like you were set up to fail. The email from Michael comes across that Workstation couldn't make a decision and told you to go see if FESCO would approve it... but even then they don't have to follow through on it because they are independent. So all that work, all the tantrums from people who just love to fly off the handle on anything, all that bull.. is for essentially nothing. Because in the end, if FESCO does approve it, it means every spin etc is stuck with it while Workstation can decide not to... even though they sent you to get the decision. That is where if I was on FESCO I would say this proposal is dead. Either a Working Group wants something and will fight for it, or they don't. If they don't and have veto authority over anything FESCO says.. then it doesn't matter what FESCO decides.
At this point, we're discussing a weird corner case where FESCo approves this change proposal and then the WG does not. I guess it's my fault for suggesting that might occur, but it's really not a very likely scenario. Reality is that the WG members are not filesystem experts and after several weeks of discussing the issue, it became clear that we need more feedback from a larger group of developers. That's what the systemwide change proposal process is designed for.
And to be clear, FESCo has veto authority over the WG, not the other way around. The WG was actually created by FESCo itself. I think technically we're a subcommittee of FESCo. Of course we certainly expect that we can ship Fedora Workstation with different defaults than the rest of Fedora, to the extent FESCo continues to allow that.
I think there has been a good deal of miscommunication on all sides (starting with me).
What I was attempting to say in the first place was this: "It's not clear to me that this proposal has the blessing of the Workstation WG or Spins. I'm not willing to *assert* that they must do this work without hearing whether they are willing and have capacity to do so." I think I phrased this poorly initially.
What I would like is just to have a statement added to the Change Proposal that "Workstation WG and the maintainers of Spins Foo, Bar and Baz are willing to make this the default if this Change Proposal is accepted." I just didn't want anyone getting *dictated* at without their input.
On Wed, Jul 1, 2020 at 8:55 AM Stephen Gallagher sgallagh@redhat.com wrote:
On Tue, Jun 30, 2020 at 9:03 PM Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 7:16 pm, Stephen John Smoogen smooge@gmail.com wrote:
The issue isn't that you haven't done your work. It is that it looks like you were set up to fail. The email from Michael comes across that Workstation couldn't make a decision and told you to go see if FESCO would approve it... but even then they don't have to follow through on it because they are independent. So all that work, all the tantrums from people who just love to fly off the handle on anything, all that bull.. is for essentially nothing. Because in the end, if FESCO does approve it, it means every spin etc is stuck with it while Workstation can decide not to... even though they sent you to get the decision. That is where if I was on FESCO I would say this proposal is dead. Either a Working Group wants something and will fight for it, or they don't. If they don't and have veto authority over anything FESCO says.. then it doesn't matter what FESCO decides.
At this point, we're discussing a weird corner case where FESCo approves this change proposal and then the WG does not. I guess it's my fault for suggesting that might occur, but it's really not a very likely scenario. Reality is that the WG members are not filesystem experts and after several weeks of discussing the issue, it became clear that we need more feedback from a larger group of developers. That's what the systemwide change proposal process is designed for.
And to be clear, FESCo has veto authority over the WG, not the other way around. The WG was actually created by FESCo itself. I think technically we're a subcommittee of FESCo. Of course we certainly expect that we can ship Fedora Workstation with different defaults than the rest of Fedora, to the extent FESCo continues to allow that.
I think there has been a good deal of miscommunication on all sides (starting with me).
What I was attempting to say in the first place was this: "It's not clear to me that this proposal has the blessing of the Workstation WG or Spins. I'm not willing to *assert* that they must do this work without hearing whether they are willing and have capacity to do so." I think I phrased this poorly initially.
What I would like is just to have a statement added to the Change Proposal that "Workstation WG and the maintainers of Spins Foo, Bar and Baz are willing to make this the default if this Change Proposal is accepted." I just didn't want anyone getting *dictated* at without their input.
To me, this sounds weird, because the implication of this Change being accepted is that we *would* do this. That's sort of the point of it. The owners of the spins are listed as change owners because I talked to all of them and they all accepted. I even have the pull request ready for Anaconda to make the change as soon as the change is accepted (I'm working on the other bits, kickstarts are complicated...).
I would say that this is redundant with the statement that "the default for new installs shall be btrfs" that is in the Change itself.
Nobody is being forced to do this in the manner I'm guessing you think.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Wed, Jul 1, 2020 at 7:54 AM Stephen Gallagher sgallagh@redhat.com wrote:
On Tue, Jun 30, 2020 at 9:03 PM Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 7:16 pm, Stephen John Smoogen smooge@gmail.com wrote:
The issue isn't that you haven't done your work. It is that it looks like you were set up to fail. The email from Michael comes across that Workstation couldn't make a decision and told you to go see if FESCO would approve it... but even then they don't have to follow through on it because they are independent. So all that work, all the tantrums from people who just love to fly off the handle on anything, all that bull.. is for essentially nothing. Because in the end, if FESCO does approve it, it means every spin etc is stuck with it while Workstation can decide not to... even though they sent you to get the decision. That is where if I was on FESCO I would say this proposal is dead. Either a Working Group wants something and will fight for it, or they don't. If they don't and have veto authority over anything FESCO says.. then it doesn't matter what FESCO decides.
At this point, we're discussing a weird corner case where FESCo approves this change proposal and then the WG does not. I guess it's my fault for suggesting that might occur, but it's really not a very likely scenario. Reality is that the WG members are not filesystem experts and after several weeks of discussing the issue, it became clear that we need more feedback from a larger group of developers. That's what the systemwide change proposal process is designed for.
And to be clear, FESCo has veto authority over the WG, not the other way around. The WG was actually created by FESCo itself. I think technically we're a subcommittee of FESCo. Of course we certainly expect that we can ship Fedora Workstation with different defaults than the rest of Fedora, to the extent FESCo continues to allow that.
I think there has been a good deal of miscommunication on all sides (starting with me).
What I was attempting to say in the first place was this: "It's not clear to me that this proposal has the blessing of the Workstation WG or Spins. I'm not willing to *assert* that they must do this work without hearing whether they are willing and have capacity to do so." I think I phrased this poorly initially.
What I would like is just to have a statement added to the Change Proposal that "Workstation WG and the maintainers of Spins Foo, Bar and Baz are willing to make this the default if this Change Proposal is accepted." I just didn't want anyone getting *dictated* at without their input.
So why not word the proposal "The Workstation WG and maintainers of Spins Foo, Bar, and Baz are free to make btrfs the default file system if they so choose"?
Justin
On Tue, Jun 30, 2020 at 10:00 AM Michael Catanzaro mcatanzaro@gnome.org wrote:
On Tue, Jun 30, 2020 at 10:26 am, Stephen Gallagher sgallagh@redhat.com wrote:
For the record, as this directly affects the Workstation deliverable, I will be voting -1 until and unless the Workstation WG votes in favor.
Yes, it's a large set of Change owners, but since only two of them are Workstation WG members, they are not representative of that group.
Workstation WG hat on:
I don't think there's any need to vote -1 for that reason alone. The Workstation WG has discussed the change proposal at several meetings recently (really, we've spent a long time on this), and frankly we were not making a ton of progress towards reaching a decision either way, so going forward with the change proposal and moving the discussion to devel@ to get feedback from a wider audience and from FESCo seemed like a good idea. Most likely, we'll wind up doing whatever FESCo chooses here, but unless FESCo were to explicitly indicate intent to override the Workstation WG, we would not consider a FESCo decision to limit what the Workstation WG can do with the Workstation product. At least, my understanding of the power structure FESCo has established is that the WG can make product-specific decisions that differ from FESCo's decisions whenever we want, unless FESCo says otherwise (because FESCo always has final say). That is, if FESCo were to approve btrfs by default, but Workstation WG were to vote to stick with ext4, then we would stick with ext4 unless FESCo were to say "no really, you need to switch to btfs" (which I highly doubt would happen). So I don't see any reason to vote -1 here out of concern for overriding the WG.
As I said earlier when I brought up the kernel stance on this, I very much consider this a Workstation WG decision. If the Workstation WG can not come to a consensus, I don't think it should be on FESCo to force one. It is not like what is there now is broken. Going from EXT4 to BTRFS or any other file system is a series of trade offs. You gain some features, you lose some features. I would recommend that FESCo not take this up until the Workstation WG can come to a consensus.
Justin
On Fri, Jun 26, 2020 at 10:42:25AM -0400, Ben Cotton wrote:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
So... can btrfs now be trusted to not crap itself?
The change is based on the installer's custom partitioning Btrfs preset. It's been well tested for 7 years.
What does "Well tested" mean, in this context? Do we have data that shows roughly how many installs were done in Fedora-land, and how long they lasted?
(two of the installs in that 7 year period were mine, and ended in complete filesystem loss across clean shutdown/restart cycles. Hardware is still in use, and other than a failed fan, hasn't so much as hiccupped since scrapping btrfs)
- Solomon
On 6/26/20 11:04 AM, Solomon Peachy wrote:
On Fri, Jun 26, 2020 at 10:42:25AM -0400, Ben Cotton wrote:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
So... can btrfs now be trusted to not crap itself?
The change is based on the installer's custom partitioning Btrfs preset. It's been well tested for 7 years.
What does "Well tested" mean, in this context? Do we have data that shows roughly how many installs were done in Fedora-land, and how long they lasted?
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
On Fri, Jun 26, 2020 at 11:15:54AM -0400, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
And, perhaps more crucially, what subset of btrfs features are in use.
(Plus perhaps the underlying hardware; I suspect the server-class hardware facebook uses is a grade above the typical desktop..)
- Solomon
On Fri, 26 Jun 2020 at 11:36, Solomon Peachy pizza@shaftnet.org wrote:
On Fri, Jun 26, 2020 at 11:15:54AM -0400, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
And, perhaps more crucially, what subset of btrfs features are in use.
(Plus perhaps the underlying hardware; I suspect the server-class hardware facebook uses is a grade above the typical desktop..)
Actually the opposite. The facebook hardware is built at scale on nearly the cheapest it can be. That said, they have a different fail scale than personal hardware.. they are ok if entire racks or areas in a DC blow themselves up because the data is spread out. That is different from a laptop where the failure means loss of everything since the last backup (aka never). So what is stored on the systems is a different usecase and how it is considered safe is different. That said they have probably put in a lot more on scale testing of the filesystem than Fedora could do.
[this is neither an endorsement or hatred of the proposal.. ]
On 6/26/20 11:15 AM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
We buy worse hardware than a typical laptop user uses, at least for our hard drives. Also we hit our disks harder than most typical Fedora users. Consider the web tier for example, we push the entire website to every box in the web tier (measured in hundreds of thousands of machines) probably 6-10 times a day. This is roughly 40 gib of data, getting written to these truly terrible consumer grade flash drives (along with some spinning rust), 6-10 times a day. In addition to the normal sort of logging, package updates, etc that happen.
Also keep in mind we pay really close attention to burn rates for our drives, because obviously at our scale it translates to millions of dollars. Btrfs has improved our burn rates with the compression, as the write amplification goes drastically down, thus extending the life of the drives.
Obviously the Facebook scale, recoverability, and workload is going to be drastically different from a random Fedora user. But hardware wise we are pretty close, at least on the disk side. Thanks,
Josef
On Fri, Jun 26, 2020 at 12:30:35PM -0400, Josef Bacik wrote:
Obviously the Facebook scale, recoverability, and workload is going to be drastically different from a random Fedora user. But hardware wise we are pretty close, at least on the disk side. Thanks,
Thanks. I guess it's really recoverability I'm most concerned with. I expect that if one of these nodes has a metadata corruption that results in an unbootable system, that's really no big deal in the big scheme of things. It's a bigger deal to home users. :)
On 6/26/20 12:43 PM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 12:30:35PM -0400, Josef Bacik wrote:
Obviously the Facebook scale, recoverability, and workload is going to be drastically different from a random Fedora user. But hardware wise we are pretty close, at least on the disk side. Thanks,
Thanks. I guess it's really recoverability I'm most concerned with. I expect that if one of these nodes has a metadata corruption that results in an unbootable system, that's really no big deal in the big scheme of things. It's a bigger deal to home users. :)
Sure, I've answered this a few different times with various members of the working group committee (or whatever they're called nowadays). I'll copy and paste what I said to them. The context is "what do we do with bad drives that blow up at the wrong time".
Now as for what does the average Fedora user do? I've also addressed that a bunch over the last few weeks, but instead of pasting like 9 emails I'll just summarize.
The UX of a completely fucked fs sucks, irregardless of the file system. Systemd currently (but will soon apparently) does not handle booting with a read only file system, which is essentially what you get when you have critical metadata corrupted. You are dumped to a emergency shell, and then you have to know what to do from there.
With ext4/xfs, you mount read only or you run fsck. With Btrfs you can do that too, but then there's like a whole level of other options depending on how bad the disk is. I've written a lot of tools over the years (which are in btrfs-progs) to recover various levels of broken file systems. To the point that you can pretty drastically mess up a FS and I'll still be able to pull data from the disk.
But, again, the UX for this _sucks_. You have to know first of all that you should try mounting read only, and then you have to get something plugged into the box and copy it over. And then assume the worst, you can't mount read only. Now with ext4/xfs that's it, you are done. With btrfs you are just getting started. You have several built in mount options for recovering different failures, all read only. But you have to know that they are there and how to use them.
These things are easily addressed with documentation, but that's only so good. This sort of scenario needs to be baked into Fedora itself, because it's the same problem no matter which file system you use. Thanks,
Josef
Email elaborating my comments about btrfs's sensitivity to bad hardware and how we test.
---------------
The fact is I can make any file system unmountable with the right corruption. The only difference with btrfs is that our metadata is completely dynamic, while xfs and ext4 are less so. So they're overwriting the same blocks over and over again, and there is simply less of "important" metadata for the file system to function.
The "problem" that btrfs has is it's main strength, it does COW. That means our important metadata is constantly being re-written to different segments of the disk. So if you have a bad disk, you are much more likely to get unlucky and end up with some core piece of metadata getting corrupted, and thus resulting in a file system that cannot be mounted read/write.
Now you are much more likely to hit this in a data segment, because generally speaking there's more data writes than metadata writes. The thing I brought up in the meeting last week was a potential downside for sure, but not something that will be a common occurrence. I just checked the fleet for this week and we've had to reprovision 20 machines out of 138 machines that threw crc errors, out of N total machines with btrfs fs'es, which is in the millions. In the same time period I have 15 xfs boxes that needed to be reprovisioned because of metadata corruption, out of <100k machines that have xfs. I don't have data on ext4 because it doesn't exist in our fleet anymore.
As for testing, there are 8 tests in xfstests that utilize my dm-log-writes target. These tests mount the file system, do a random workload, and then replay the workload one write at a time to validate the file system isn't left in some intermediate broken state. This simulates the case of weird things happening but in a much more concrete and repeatable manner.
There's 65 tests that utilize dm-flakey, which randomly corrupts or drops writes, and again these are to test different scenarios that have given us issues in the past. There's more of these because up until a few years ago this was our only mechanism for testing this class of failures. I wrote dm-log-writes to bring some determinism to our testing.
All of our file systems in linux are extremely thoroughly tested for a variety of power fail cases. The only area that btrfs is more likely to screw up is in the case of bad hardware, and again we're not talking like huge percentage points difference. It's a trade off. You are trading a slight increased percentage that bad hardware will result in a file system that cannot be mounted read/write for the ability to detect silent corruption from your memory, cpu, or storage device. Thanks,
Josef
Le vendredi 26 juin 2020 à 12:30 -0400, Josef Bacik a écrit :
On 6/26/20 11:15 AM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
We buy worse hardware than a typical laptop user uses, at least for our hard drives.
The difference between an operation like Facebook and the Fedora user base, it that Facebook will have a huge fleet of crap hardware, with the support teams to baby-sit the crap hardware, and attention to reducing the variety of crap hardware to limit the support matrix breadth, while Fedora has to deal with a huge support matrix breadth, without the support teams and the support team tooling to baby-sit hardware. (Besides Facebook designs the levels of crapiness they allow in their hardware, meaning they know exactly where they are pushing limits to lower hardware costs).
And, it’s not always the crap hardware that hits bugs. Sometimes expensive gamer hardware will fail first because its manufacturer has pushed the limits to eke some performance points over the competition.
Therefore, using btrfs in Fedora, is inherently more ambitious, than using it at Facebook.
Regards,
On 6/27/20 2:57 AM, Nicolas Mailhot via devel wrote:
Le vendredi 26 juin 2020 à 12:30 -0400, Josef Bacik a écrit :
On 6/26/20 11:15 AM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
We buy worse hardware than a typical laptop user uses, at least for our hard drives.
The difference between an operation like Facebook and the Fedora user base, it that Facebook will have a huge fleet of crap hardware, with the support teams to baby-sit the crap hardware, and attention to reducing the variety of crap hardware to limit the support matrix breadth, while Fedora has to deal with a huge support matrix breadth, without the support teams and the support team tooling to baby-sit hardware. (Besides Facebook designs the levels of crapiness they allow in their hardware, meaning they know exactly where they are pushing limits to lower hardware costs).
And, it’s not always the crap hardware that hits bugs. Sometimes expensive gamer hardware will fail first because its manufacturer has pushed the limits to eke some performance points over the competition.
Therefore, using btrfs in Fedora, is inherently more ambitious, than using it at Facebook.
I've been very clear from the outset that Facebook's fault tolerance is much higher than the average Fedora user. The only reason I've agreed to assist in answering questions and support this proposal is because I have multi-year data that shows our failure rates are the same that we see on every other file system, which is basically the failure rate of the disks themselves.
And I specifically point out the hardware that we use that most closely reflects the drives that an average Fedora user is going to have. We of course have a very wide variety of hardware. In fact the very first thing we deployed on were these expensive hardware RAID setups. Btrfs found bugs in that firmware that was silently corrupting data. These corruptions had been corrupting AI test data for years under XFS, and Btrfs found it in a matter of days because of our checksumming.
We use all sorts of hardware, and have all sorts of similar stories like this. I agree that the hardware is going to be muuuuuch more varied with Fedora users, and that Facebook has muuuuch higher fault tolerance. But higher production failures inside FB means more engineering time spent dealing with those failures, which translates to lost productivity. If btrfs was causing us to run around fixing it all the time then we wouldn't deploy it. The fact is that it's not, it's perfectly stable from our perspective. Thanks,
Josef
I've been very clear from the outset that Facebook's fault tolerance is much higher than the average Fedora user. The only reason I've agreed to assist in answering questions and support this proposal is because I have multi-year data that shows our failure rates are the same that we see on every other file system, which is basically the failure rate of the disks themselves.
And I specifically point out the hardware that we use that most closely reflects the drives that an average Fedora user is going to have. We of course have a very wide variety of hardware. In fact the very first thing we deployed on were these expensive hardware RAID setups. Btrfs found bugs in that firmware that was silently corrupting data. These corruptions had been corrupting AI test data for years under XFS, and Btrfs found it in a matter of days because of our checksumming.
We use all sorts of hardware, and have all sorts of similar stories like this. I agree that the hardware is going to be muuuuuch more varied with Fedora users, and that Facebook has muuuuch higher fault tolerance. But higher production failures inside FB means more engineering time spent dealing with those failures, which translates to lost productivity. If btrfs was causing us to run around fixing it all the time then we wouldn't deploy it. The fact is that it's not, it's perfectly stable from our perspective. Thanks,
Thanks for the details, you have any data/information/opinions on non x86 architectures such as aarch64/armv7/ppc64le all of which have supported desktops too?
Peter
On 6/27/20 9:57 AM, Peter Robinson wrote:
I've been very clear from the outset that Facebook's fault tolerance is much higher than the average Fedora user. The only reason I've agreed to assist in answering questions and support this proposal is because I have multi-year data that shows our failure rates are the same that we see on every other file system, which is basically the failure rate of the disks themselves.
And I specifically point out the hardware that we use that most closely reflects the drives that an average Fedora user is going to have. We of course have a very wide variety of hardware. In fact the very first thing we deployed on were these expensive hardware RAID setups. Btrfs found bugs in that firmware that was silently corrupting data. These corruptions had been corrupting AI test data for years under XFS, and Btrfs found it in a matter of days because of our checksumming.
We use all sorts of hardware, and have all sorts of similar stories like this. I agree that the hardware is going to be muuuuuch more varied with Fedora users, and that Facebook has muuuuch higher fault tolerance. But higher production failures inside FB means more engineering time spent dealing with those failures, which translates to lost productivity. If btrfs was causing us to run around fixing it all the time then we wouldn't deploy it. The fact is that it's not, it's perfectly stable from our perspective. Thanks,
Thanks for the details, you have any data/information/opinions on non x86 architectures such as aarch64/armv7/ppc64le all of which have supported desktops too?
I can't speak to ppc* at all, and I'm not sure how much I can talk about our arm stuff, but it was tested and used in production on arm a few years ago. But obviously the bulk of our workload is x86. Thanks,
Josef
On Sat, Jun 27, 2020 at 7:58 AM Peter Robinson pbrobinson@gmail.com wrote:
I've been very clear from the outset that Facebook's fault tolerance is much higher than the average Fedora user. The only reason I've agreed to assist in answering questions and support this proposal is because I have multi-year data that shows our failure rates are the same that we see on every other file system, which is basically the failure rate of the disks themselves.
And I specifically point out the hardware that we use that most closely reflects the drives that an average Fedora user is going to have. We of course have a very wide variety of hardware. In fact the very first thing we deployed on were these expensive hardware RAID setups. Btrfs found bugs in that firmware that was silently corrupting data. These corruptions had been corrupting AI test data for years under XFS, and Btrfs found it in a matter of days because of our checksumming.
We use all sorts of hardware, and have all sorts of similar stories like this. I agree that the hardware is going to be muuuuuch more varied with Fedora users, and that Facebook has muuuuch higher fault tolerance. But higher production failures inside FB means more engineering time spent dealing with those failures, which translates to lost productivity. If btrfs was causing us to run around fixing it all the time then we wouldn't deploy it. The fact is that it's not, it's perfectly stable from our perspective. Thanks,
Thanks for the details, you have any data/information/opinions on non x86 architectures such as aarch64/armv7/ppc64le all of which have supported desktops too?
Sample size of 1: Raspberry Pi Zero running Arch for ~ a year. I use mount option -o compress=zstd:1. I haven't benchmarked it, it's a Pi Zero so it's slow no matter what file system is used. But anecdotally I can't tell a difference enough to even speculate.
This is a bit of an overly verbose mess, but the take away is that at least for /usr I'm saving about 41%. Space and writes.
$ sudo compsize /usr Processed 48038 files, 28473 regular extents (28757 refs), 25825 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 59% 879M 1.4G 1.4G none 100% 435M 435M 435M lzo 54% 153M 281M 287M zstd 37% 289M 767M 786M $
I could instead selectively compress just certain directories or files, using an XATTR (there is a btrfs command for setting it). Compression can aslo be applied after the fact by defragmenting with a compression option.
I think the reduction in write amplification in this use case is significant because SD cards are just so impressively terrible. I have only ever seen them return garbage rather than the device itself admit a read error (UNC read error), and btrfs will catch that. I seriously would only ever use btrfs for this. I might consider another file system if I were using industrial SD cards, but *shrug* in that case I'd probably spend a bit more time benchmarking things and seeing if i can squeak out a bit more performance from lzo or zstd:1 on reads due to a reduction in IO latency. Because SLC is going to be slower than TLC or anything else.
I don't know much about eMMC media, but if it's a permanent resident on the board, all the more reason I'd use btrfs and compress everything. I *might* even consider changing the compression level to something more aggressive for updates because the performance limitation isn't the compression hit, but rather the internet bandwidth. This is as simple as 'mount -o remount,compress=zstd:9 /' and then do the update - and upon reboot it's still zstd:1 or whatever is in fstab/systemd mount unit. A future feature might be to add level to the existing XATTR method of setting compression per dir or per file. So you could indicate things like "always use heavier compression" for specific dirs.
On Fri, Jun 26, 2020 at 6:30 PM Josef Bacik josef@toxicpanda.com wrote:
On 6/26/20 11:15 AM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:13:39AM -0400, Josef Bacik wrote:
Not Fedora land, but Facebook installs it on all of our root devices, so millions of machines. We've done this for 5 years. It's worked out very well. Thanks,
Josef, I'd love to hear your comments on any differences between that situation and the typical laptop-user case for Fedora desktop systems. Anything we should consider?
We buy worse hardware than a typical laptop user uses, at least for our hard drives. Also we hit our disks harder than most typical Fedora users. Consider the web tier for example, we push the entire website to every box in the web tier (measured in hundreds of thousands of machines) probably 6-10 times a day. This is roughly 40 gib of data, getting written to these truly terrible consumer grade flash drives (along with some spinning rust), 6-10 times a day. In addition to the normal sort of logging, package updates, etc that happen.
Also keep in mind we pay really close attention to burn rates for our drives, because obviously at our scale it translates to millions of dollars. Btrfs has improved our burn rates with the compression, as the write amplification goes drastically down, thus extending the life of the drives.
Hi Josef,
Out of curiosity, do you also monitor SMART data for all your hard drives? If yes, have you seen any correlations between specific errors reported by btrfs and those picked up by SMART (not necessarily the fatal ones)? Any useful conclusions?
Best regards, A.
Why not zfs?
On 6/26/2020 10:42 AM, Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/BtrfsByDefault
== Summary ==
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
== Owners ==
- Names: [[User:Chrismurphy|Chris Murphy]], [[User:Ngompa|Neal
Gompa]], [[User:Josef|Josef Bacik]], [[User:Salimma|Michel Alexandre Salim]], [[User:Dcavalca|Davide Cavalca]], [[User:eeickmeyer|Erich Eickmeyer]], [[User:ignatenkobrain|Igor Raits]], [[User:Raveit65|Wolfgang Ulbrich]], [[User:Zsun|Zamir SUN]], [[User:rdieter|Rex Dieter]], [[User:grinnz|Dan Book]], [[User:nonamedotc|Mukundan Ragavan]]
- Emails: chrismurphy@fedoraproject.org, ngompa13@gmail.com,
josef@toxicpanda.com, michel@michel-slm.name, dcavalca@fb.com, erich@ericheickmeyer.com, ignatenkobrain@fedoraproject.org, fedora@raveit.de, zsun@fedoraproject.org, rdieter@gmail.com, grinnz@gmail.com, nonamedotc@gmail.com
- Products: All desktop editions, spins, and labs
- Responsible WGs: Workstation Working Group, KDE Special Interest Group
== Detailed Description ==
Fedora desktop edition/spin variants will switch to using Btrfs as the filesystem by default for new installs. Labs derived from these variants inherit this change, and other editions may opt into this change.
The change is based on the installer's custom partitioning Btrfs preset. It's been well tested for 7 years.
'''''Current partitioning'''''<br /> <span style="color: tomato">vg/root</span> LV mounted at <span style="color: tomato">/</span> and a <span style="color: tomato">vg/home</span> LV mounted at <span style="color: tomato">/home</span>. These are separate file system volumes, with separate free/used space.
'''''Proposed partitioning'''''<br /> <span style="color: tomato">root</span> subvolume mounted at <span style="color: tomato">/</span> and <span style="color: tomato">home</span> subvolume mounted at <span style="color: tomato">/home</span>. Subvolumes don't have size, they act mostly like directories, space is shared.
'''''Unchanged'''''<br /> <span style="color: tomato">/boot</span> will be a small ext4 volume. A separate boot is needed to boot dm-crypt sysroot installations; it's less complicated to keep the layout the same, regardless of whether sysroot is encrypted. There will be no automatic snapshots/rollbacks.
If you select to encrypt your data, LUKS (dm-crypt) will be still used as it is today (with the small difference that Btrfs is used instead of LVM+Ext4). There is upstream work on getting native encryption for Btrfs that will be considered once ready and is subject of a different change proposal in a future Fedora release.
=== Optimizations (Optional) ===
The detailed description above is the proposal. It's intended to be a minimalist and transparent switch. It's also the same as was [[Features/F16BtrfsDefaultFs|proposed]] (and [https://lwn.net/Articles/446925/ accepted]) for Fedora 16. The following optimizations improve on the proposal, but are not critical. They are also transparent to most users. The general idea is agree to the base proposal first, and then consider these as enhancements.
==== Boot on Btrfs ====
- Instead of a 1G ext4 boot, create a 1G Btrfs boot.
- Advantage: Makes it possible to include in a snapshot and rollback
regime. GRUB has stable support for Btrfs for 10+ years.
- Scope: Contingent on bootloader and installer team review and
approval. blivet should use <code>mkfs.btrfs --mixed</code>.
==== Compression ====
- Enable transparent compression using zstd on select directories:
<span style="color: tomato">/usr</span> <span style="color: tomato">/var/lib/flatpak</span> <span style="color: tomato">~/.local/share/flatpak</span>
- Advantage: Saves space and significantly increase the lifespan of
flash-based media by reducing write amplification. It may improve performance in some instances.
- Scope: Contingent on installer team review and approval to enhance
anaconda to perform the installation using <code>mount -o compress=zstd</code>, then set the proper XATTR for each directory. The XATTR can't be set until after the directories are created via: rsync, rpm, or unsquashfs based installation.
==== Additional subvolumes ====
- <span style="color: tomato">/var/log/</span> <span style="color:
tomato">/var/lib/libvirt/images</span> and <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> will use separate subvolumes.
- Advantage: Makes it easier to excluded them from snapshots,
rollbacks, and send/receive. (Btrfs snapshotting is not recursive, it stops at a nested subvolume.)
- Scope: Anaconda knows how to do this already, just change the
kickstart to add additional subvolumes (minus the subvolume in <span style="color: tomato">~/</span>. GNOME Boxes will need enhancement to detect that the user home is on Btrfs and create <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> as a subvolume.
== Feedback ==
==== Red Hat doesn't support Btrfs? Can Fedora do this? ====
Red Hat supports Fedora well, in many ways. But Fedora already works closely with, and depends on, upstreams. And this will be one of them. That's an important consideration for this proposal. The community has a stake in ensuring it is supported. Red Hat will never support Btrfs if Fedora rejects it. Fedora necessarily needs to be first, and make the persuasive case that it solves more problems than alternatives. Feature owners believe it does, hands down.
The Btrfs community has users that have been using it for most of the past decade at scale. It's been the default on openSUSE (and SUSE Linux Enterprise) since 2014, and Facebook has been using it for all their OS and data volumes, in their data centers, for almost as long. Btrfs is a mature, well-understood, and battle-tested file system, used on both desktop/container and server/cloud use-cases. We do have developers of the Btrfs filesystem maintaining and supporting the code in Fedora, one is a Change owner, so issues that are pinned to Btrfs can be addressed quickly.
==== What about device-mapper alternatives? ====
dm-thin (thin provisioning): [[https://pagure.io/fedora-workstation/issue/152 Issue #152] still happens, because the installer won't over provision by default. It still requires manual intervention by the user to identify and resolve the problem. Upon growing a file system on dm-thin, the pool is over committed, and file system sizes become a fantasy: they don't add up to the total physical storage available. The truth of used and free space is only known by the thin pool, and CLI and GUI programs are unprepared for this. Integration points like rpm free space checks or GNOME disk-space warnings would have to be adapted as well.
dm-vdo: is not yet merged, and isn't as straightforward to selectively enable per directory and per file, as is the case on Btrfs using <code>chattr +c</code> on <span style="color: tomato">/var/lib/flatpaks/</span>.
Btrfs solves the problems that need solving, with few side effects or pitfalls for users. It has more features we can take advantage of immediately and transparently: compression, integrity, and IO isolation. Many Btrfs features and optimizations can be opted into selectively per directory or file, such as compression and nodatacow, rather than as a layer that's either on or off.
==== What about UI/UX and integration in the desktop? ====
If Btrfs isn't the default file system, there's no commitment, nor reason to work on any UI/UX integration. There are ideas to make certain features discoverable: selective compression; systemd-homed may take advantage of either Btrfs online resize, or near-term planned native encryption, which could make it possible to live convert non-encrypted homes to encrypted; and system snapshot and rollbacks.
Anaconda already has sophisticated Btrfs integration.
==== What Btrfs features are recommended and supported? ====
The primary goal of this feature is to be largely transparent to the user. It does not require or expect users to learn new commands, or to engage in peculiar maintenance rituals.
The full set of Btrfs features that is considered stable and enabled by default upstream will be enabled in Fedora. Fedora is a community project. What is supported within Fedora depends on what the community decides to put forward in terms of resources.
The upstream [https://btrfs.wiki.kernel.org/index.php/Status Btrfs feature status page].
==== Are subvolumes really mostly like directories? ====
Subvolumes behave like directories in terms of navigation in both the GUI and CLI, e.g. <code>cp</code>, <code>mv</code>, <code>du</code>, owner/permissions, and SELinux labels. They also share space, just like a directory.
But it is an incomplete answer.
A subvolume is an independent file tree, with its own POSIX namespace, and has its own pool of inodes. This means inode numbers repeat themselves on a Btrfs volume. Inodes are only unique within a given subvolume. A subvolume has its own st_dev, so if you use <code>stat FILE</code> it reports a device value referring to the subvolume the file is in. And it also means hard links can't be created between subvolumes. From this perspective, subvolumes start looking more like a separate file system. But subvolumes share most of the other trees, so they're not truly independent file systems. They're also not block devices.
== Benefit to Fedora ==
Problems Btrfs helps solve:
- Users running out of free space on either <span style="color:
tomato">/</span> or <span style="color: tomato">/home</span> [https://pagure.io/fedora-workstation/issue/152 Workstation issue #152] ** "one big file system": no hard barriers like partitions or logical volumes ** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware ** reflinks and snapshots are more efficient for use cases like containers (Podman supports both)
- Storage devices can be flaky, resulting in data corruption
** Everything is checksummed and verified on every read ** Corrupt data results in EIO (input/output error), instead of resulting in application confusion, and isn't replicated into backups and archives
- Poor desktop responsiveness when under pressure
[https://pagure.io/fedora-workstation/issue/154 Workstation issue #154] ** Currently only Btrfs has proper IO isolation capability via cgroups2 ** Completes the resource control picture: memory, cpu, IO isolation
- File system resize
** Online shrink and grow are fundamental to the design
- Complex storage setups are... complicated
** Simple and comprehensive command interface. One master command ** Simpler to boot, all code is in the kernel, no initramfs complexities ** Simple and efficient file system replication, including incremental backups, with <code>btrfs send</code> and <code>btrfs receive</code>
== Scope ==
- Proposal owners:
** Submit PR's for Anaconda to change <code>default_scheme = BTRFS</code> to the proper product files. ** Multiple test days: build community support network ** Aid with documentation
- Other developers:
** Anaconda, review PRs and merge ** Bootloader team, review PRs and merge ** Recommended optimization <code>chattr +C</code> set on the containing directory for virt-manager and GNOME Boxes.
Release engineering: [https://pagure.io/releng/issue/9545 #9545]
Policies and guidelines: N/A
Trademark approval: N/A
== Upgrade/compatibility impact ==
Change will not affect upgrades.
Documentation will be provided for existing Btrfs users to "retrofit" their setups to that of a default Btrfs installation (base plus any approved options).
== How To Test ==
'''''Today'''''<br /> Do a custom partitioning installation; change the scheme drop-down menu to Btrfs; click the blue "automatically create partitions"; and install.<br /> Fedora 31, 32, Rawhide, on x86_64 and ARM.
'''''Once change lands'''''<br /> It should be simple enough to test, just do a normal install.
== User Experience ==
==== Pros ====
- Mostly transparent
- Space savings from compression
- Longer lifespan of hardware, also from compression.
- Utilities for used and free space, CLI and GUI, are expected to
behave the same. No special commands are required.
- More detailed information can be revealed by <code>btrfs</code>
specific commands.
==== Enhancement opportunities ====
[https://bugzilla.redhat.com/show_bug.cgi?id=906591 updatedb does not index /home when /home is a bind mount] Also can affected rpm-ostree installations, including Silverblue.
[https://gitlab.gnome.org/GNOME/gnome-usage/-/issues/49 GNOME Usage: Incorrect numbers when using multiple btrfs subvolumes] This isn't Btrfs specific, happens with "one big ext4" volume as well.
[https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/88 GNOME Boxes, RFE: create qcow2 with 'nocow' option when on btrfs /home] This is Btrfs specific, and is a recommended optimization for both GNOME Boxes and virt-manager.
[https://github.com/containers/libpod/issues/6563 containers/libpod: automatically use btrfs driver if on btrfs]
== Dependencies ==
None.
== Contingency Plan ==
Contingency mechanism: Owner will revert changes back to LVM+ext4
Contingency deadline: Beta freeze
Blocks release? Yes
Blocks product? Workstation and KDE
== Documentation ==
Strictly speaking no documentation is required reading for users. But there will be some Fedora documentation to help get the ball rolling.
For those who want to know more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page btrfs wiki main page and full feature list.]
<code>man 5 btrfs</code> contains: mount options, features, swapfile support, checksum algorithms, and more<br /> <code>man btrfs</code> contains an overview of the btrfs subcommands<br /> <code>man btrfs <nowiki><subcommand></nowiki></code> will show the man page for that subcommand
NOTE: The btrfs command will accept partial subcommands, as long as it's not ambiguous. These are equivalent commands:<br /> <code>btrfs subvolume snapshot</code><br /> <code>btrfs sub snap</code><br /> <code>btrfs su sn</code>
You'll discover your own convention. It might be preferable to write out the full command on forums and lists, but then maybe some folks don't learn about this useful shortcut?
For those who want to know a lot more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation Btrfs developer documentation]<br /> [https://github.com/btrfs/btrfs-dev-docs/blob/master/trees.txt Btrfs trees]
== Release Notes == The default file system on the desktop is Btrfs.
On Fri, Jun 26, 2020 at 11:15:24AM -0400, Michael Watters wrote:
Why not zfs?
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS is not really intended for the laptop use case.
On Friday, June 26, 2020 8:22:49 AM MST Matthew Miller wrote:
On Fri, Jun 26, 2020 at 11:15:24AM -0400, Michael Watters wrote:
Why not zfs?
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS is not really intended for the laptop use case.
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
On Sun, 28 Jun 2020 09:59:52 -0700, you wrote:
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
For a start they aren't a US company, and unlike Red Hat they aren't the same tempting target for a lawsuit.
On Sunday, June 28, 2020 5:14:14 PM MST Gerald Henriksen wrote:
On Sun, 28 Jun 2020 09:59:52 -0700, you wrote:
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
For a start they aren't a US company, and unlike Red Hat they aren't the same tempting target for a lawsuit.
I fail to see how being a US company or not would have much bearing on this. As for being a "tempting target", they're both big tech companies providing a Linux distro as their primary product, working on a support model.
On Sun, Jun 28, 2020 at 09:59:52AM -0700, John M. Harris Jr wrote:
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS is not really intended for the laptop use case.
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
I can't really speculate on Canonical's legal stance and I encourage everyone else to also not.
I can point to Red Hat's, though: the knowledge base article here https://access.redhat.com/solutions/79633 says:
* ZFS is not included in the upstream Linux kernel due to licensing reasons.
* Red Hat applies the upstream first policy for kernel modules (including filesystems). Without upstream presence, kernel modules like ZFS cannot be supported by Red Hat.
and "due to licensing reasons" links to https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/ which is quite interesting and quite long. If you have just time to read one section, the two paragraphs at the end under "Do Not Rely On This Document As Legal Advice" seem like the _most_ interesting to me.
On Monday, June 29, 2020 9:26:09 AM MST Matthew Miller wrote:
On Sun, Jun 28, 2020 at 09:59:52AM -0700, John M. Harris Jr wrote:
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS is not really intended for the laptop use case.
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
I can't really speculate on Canonical's legal stance and I encourage everyone else to also not.
I can point to Red Hat's, though: the knowledge base article here https://access.redhat.com/solutions/79633 says:
- ZFS is not included in the upstream Linux kernel due to licensing
reasons.
- Red Hat applies the upstream first policy for kernel modules (including filesystems). Without upstream presence, kernel modules like ZFS cannot
be supported by Red Hat.
and "due to licensing reasons" links to https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/ which is quite interesting and quite long. If you have just time to read one section, the two paragraphs at the end under "Do Not Rely On This Document As Legal Advice" seem like the _most_ interesting to me.
I've both read that page, and linked to it further down in this thread. Yes, I believe that Canonical's implementation is a GPL violation, but it doesn't need to be. So long as the source is in a separate package, and it's packaged as a kmod, it wouldn't be a GPL violation. It's worth considering, in my opinion, whether or not it'd be available for RHEL. It wouldn't be the first package RHEL doesn't have, but Fedora does. :)
On Monday, June 29, 2020, John M. Harris Jr johnmh@splentity.com wrote:
On Monday, June 29, 2020 9:26:09 AM MST Matthew Miller wrote:
On Sun, Jun 28, 2020 at 09:59:52AM -0700, John M. Harris Jr wrote:
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS
is
not really intended for the laptop use case.
Has that actually been explored? How does Canonical get around the
legal
issues with OpenZFS' licensing?
I can't really speculate on Canonical's legal stance and I encourage everyone else to also not.
I can point to Red Hat's, though: the knowledge base article here https://access.redhat.com/solutions/79633 says:
- ZFS is not included in the upstream Linux kernel due to licensing
reasons.
- Red Hat applies the upstream first policy for kernel modules (including filesystems). Without upstream presence, kernel modules like ZFS cannot
be supported by Red Hat.
and "due to licensing reasons" links to https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/ which is quite interesting and quite long. If you have just time to read one section,
the
two paragraphs at the end under "Do Not Rely On This Document As Legal Advice" seem like the _most_ interesting to me.
I've both read that page, and linked to it further down in this thread. Yes, I believe that Canonical's implementation is a GPL violation, but it doesn't need to be. So long as the source is in a separate package, and it's packaged as a kmod, it wouldn't be a GPL violation. It's worth considering, in my opinion, whether or not it'd be available for RHEL. It wouldn't be the first package RHEL doesn't have, but Fedora does. :)
That's not how the GPL work - using that argument you can link anything to a GPL only library as long as it is in a separate source tree (which is mostly the case).
It's either a derived work of the kernel and thus is bound by the GPL restrictions or it isn't. Does not matter in which tarball or rpm or $whatever it is in.
On Mon, Jun 29, 2020 at 10:20:17AM -0700, John M. Harris Jr wrote:
I've both read that page, and linked to it further down in this thread. Yes, I believe that Canonical's implementation is a GPL violation, but it doesn't need to be. So long as the source is in a separate package, and it's packaged as a kmod, it wouldn't be a GPL violation. It's worth considering, in my opinion, whether or not it'd be available for RHEL. It wouldn't be the first package RHEL doesn't have, but Fedora does. :)
The Conservancy page does address source distribution as well, and they (as well as Red Hat's lawyers) have a different conclusion.
It's not a GPL violation. OpenZFS works under Linux through a compatibility layer called SPL, the Solaris Porting Layer. SPL is licensed under GPL. Torvalds himself said that a non-GPL file system that was written for another OS cannot be considered a derivative of the Linux kernel: https://yarchive.net/comp/linux/gpl_modules.html
SPL is a derived work from the Linux kernel because it's designed for the Linux kernel. SPL is therefore under GPL. ZFS is designed for Solaris and therefore a different license is fine.
Dell, a friggin huge US company, wouldn't distribute Ubuntu with their laptops if they as the distributor did something illegal.
On Monday, June 29, 2020 3:40:57 PM MST Markus S. wrote:
It's not a GPL violation. OpenZFS works under Linux through a compatibility layer called SPL, the Solaris Porting Layer. SPL is licensed under GPL. Torvalds himself said that a non-GPL file system that was written for another OS cannot be considered a derivative of the Linux kernel: https://yarchive.net/comp/linux/gpl_modules.html
SPL is a derived work from the Linux kernel because it's designed for the Linux kernel. SPL is therefore under GPL. ZFS is designed for Solaris and therefore a different license is fine.
Dell, a friggin huge US company, wouldn't distribute Ubuntu with their laptops if they as the distributor did something illegal.
That's a good point, I didn't think about that. Additionally, having the context from Linus is very useful, thank you for that!
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Mon, 2020-06-29 at 12:26 -0400, Matthew Miller wrote:
On Sun, Jun 28, 2020 at 09:59:52AM -0700, John M. Harris Jr wrote:
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS is not really intended for the laptop use case.
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
I can't really speculate on Canonical's legal stance and I encourage everyone else to also not.
I can point to Red Hat's, though: the knowledge base article here https://access.redhat.com/solutions/79633 says:
- ZFS is not included in the upstream Linux kernel due to licensing
reasons.
- Red Hat applies the upstream first policy for kernel modules
(including filesystems). Without upstream presence, kernel modules like ZFS cannot be supported by Red Hat.
This is not fully true to my knowledge. Red Hat ships VDO and that is not even sent to upstream (yet?).
and "due to licensing reasons" links to https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/ which is quite interesting and quite long. If you have just time to read one section, the two paragraphs at the end under "Do Not Rely On This Document As Legal Advice" seem like the _most_ interesting to me.
-- Matthew Miller mattdm@fedoraproject.org Fedora Project Leader _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
- -- Igor Raits ignatenkobrain@fedoraproject.org
Hi,
On 29/06/2020 19:54, Igor Raits wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Mon, 2020-06-29 at 12:26 -0400, Matthew Miller wrote:
On Sun, Jun 28, 2020 at 09:59:52AM -0700, John M. Harris Jr wrote:
We cannot include ZFS in Fedora for legal reasons. Additionally, ZFS is not really intended for the laptop use case.
Has that actually been explored? How does Canonical get around the legal issues with OpenZFS' licensing?
I can't really speculate on Canonical's legal stance and I encourage everyone else to also not.
I can point to Red Hat's, though: the knowledge base article here https://access.redhat.com/solutions/79633 says:
- ZFS is not included in the upstream Linux kernel due to licensing
reasons.
- Red Hat applies the upstream first policy for kernel modules
(including filesystems). Without upstream presence, kernel modules like ZFS cannot be supported by Red Hat.
This is not fully true to my knowledge. Red Hat ships VDO and that is not even sent to upstream (yet?).
It has taken a bit longer than perhaps expected. However the intention is very much that it will go upstream,
Steve.
OpenZFS is frequently lagging behind in support for newer kernels which would work against Fedora's "rolling" approach to kernel releases.
Proxmox and Ubuntu don't feature rolling kernel releases. That's why they can ship OpenZFS (without legal problems, btw).
OpenZFS is frequently lagging behind in support for newer kernels which would work against Fedora's "rolling" approach to kernel releases.
Yes, there is quite often a time delay between kernel releases and OpenZFS releases that contain compatibility patches. However, in my experience, the OpenZFS developers are aware of this and act rather quickly. I believe that if a project like Fedora were to switch to ZFS, this would not be an issue at all - ZFS compatibility patches are usually available early on during the kernel development cycle, the delay is mostly due to the lack of testing and review.
Proxmox and Ubuntu don't feature rolling kernel releases. That's why they can ship OpenZFS (without legal problems, btw).
Would you care to elaborate why a rolling release kernel is not hit by any legal problems? I fail to see how that is relevant here, but then again, I am certainly not a lawyer and my understanding of the legal implications is rudimentary at best.
-Armin
On Fri, Jun 26, 2020 at 10:42:25AM -0400, Ben Cotton wrote:
==== Boot on Btrfs ====
- Instead of a 1G ext4 boot, create a 1G Btrfs boot.
- Advantage: Makes it possible to include in a snapshot and rollback
regime. GRUB has stable support for Btrfs for 10+ years.
GRUB2 btrfs support tend to lag a bit. User would need to be careful not to enable some btrfs features before GRUB2 supports them.
- Scope: Contingent on bootloader and installer team review and
approval. blivet should use <code>mkfs.btrfs --mixed</code>.
When going with btrfs /boot, you can forego separate partition and just make a /boot subvolume in main pool.
Advantage: fewer partitions.
Disadvantages: using encryption is harder. GRUB2 supports only LUKS1 encryption (AFAIK). Obviously, there is not plymouth integration, so the password would have to be entered at least twice. When not using encryption above is not a problem.
On 6/26/20 8:23 AM, Tomasz Torcz wrote:
On Fri, Jun 26, 2020 at 10:42:25AM -0400, Ben Cotton wrote:
==== Boot on Btrfs ====
...
When going with btrfs /boot, you can forego separate partition and just make a /boot subvolume in main pool.
Advantage: fewer partitions.
Disadvantages: using encryption is harder. GRUB2 supports only LUKS1 encryption (AFAIK). Obviously, there is not plymouth integration, so the password would have to be entered at least twice. When not using encryption above is not a problem.
Once there's native btrfs encryption, agreed, /boot should just be a separate subvolume where we make sure we don't turn on any unsupported features.
Regards,
On 6/26/20 5:23 PM, Tomasz Torcz wrote:
Disadvantages: using encryption is harder. GRUB2 supports only LUKS1 encryption (AFAIK). Obviously, there is not plymouth integration, so the password would have to be entered at least twice. When not using encryption above is not a problem.
There's support for LUKS2 already waiting for next GRUB2 upstream release, AFAICT: https://git.savannah.gnu.org/cgit/grub.git/commit/?id=365e0cc3e7e44151c14dd2...
Regards O.
On Fr, 26.06.20 10:42, Ben Cotton (bcotton@redhat.com) wrote:
If this is decided to be the way to go, please work with kernel maintainers to make btrfs.ko a built-in kernel module, so that initrd-less boots work... (it's kinda pointless anyway to have something as module that is now gonna used by most people anyway, it just slows things down for little benefit)
Lennart
-- Lennart Poettering, Berlin
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Fri, 2020-06-26 at 17:30 +0200, Lennart Poettering wrote:
On Fr, 26.06.20 10:42, Ben Cotton (bcotton@redhat.com) wrote:
If this is decided to be the way to go, please work with kernel maintainers to make btrfs.ko a built-in kernel module, so that initrd-less boots work... (it's kinda pointless anyway to have something as module that is now gonna used by most people anyway, it just slows things down for little benefit)
Good point, we'll make sure to not forget about it.
Lennart
-- Lennart Poettering, Berlin _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
- -- Igor Raits ignatenkobrain@fedoraproject.org
On Fri, Jun 26, 2020 at 10:30 AM Lennart Poettering mzerqung@0pointer.de wrote:
On Fr, 26.06.20 10:42, Ben Cotton (bcotton@redhat.com) wrote:
If this is decided to be the way to go, please work with kernel maintainers to make btrfs.ko a built-in kernel module, so that initrd-less boots work... (it's kinda pointless anyway to have something as module that is now gonna used by most people anyway, it just slows things down for little benefit)
That would make sense if this were decided. My big issue with this is we have no internal RH expertise on btrfs, and would be entirely dependent on the upstream community for support. There are instances of CVEs that get ignored for long periods of time, CVE-2019-19378 and CVE-2019-19448 being the current examples, with the later being not a huge deal, but still an outstanding CVE. In general btrfs CVEs tend to stick around longer than XFS and ext4 before a fix is pushed upstream. The Fedora kernel supports btrfs, it has for quite some time, and that doesn't change regardless of the outcome of this proposal. I honestly cannot tell you what the stability would be like spread across the majority of fedora users, because not being default, the typical btrfs user probably currently has a better understanding of what they are getting into. While the lack of internal RH expertise makes me lean against this proposal, I believe the Desktop SIG with FESCo should be able to make the decisions for defaults on the desktop spin.
On Fri, Jun 26, 2020 at 8:45 AM Ben Cotton bcotton@redhat.com wrote:
Related: Chromebooks are using btrfs in a particular way. ChromeOS has something called Crostini which is a set of technologies they use for enabling native Linux app support. This is LXC/LXD based, and uses a (I think per user) loop mounted file that is btrfs, and they leverage btrfs snapshotting for the containers.
Once upon a time, Ben Cotton bcotton@redhat.com said:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
So... I freely admit I have not looked closely at btrfs in some time, so I could be out of date (and my apologies if so). One issue that I have seen mentioned as an issue within the last week is still the problem of running out of space when it still looks like there's space free. I didn't read the responses, so not sure of the resolution, but I remember that being a "thing" with btrfs. Is that still the case? What are the causes, and if so, how can we keep from getting a lot of the same question on mailing lists/forums/etc.?
I'm pretty neutral on this... I run a bunch of RHEL/CentOS systems, so I tend to stick close to that on my Fedora systems (so I'd probably stick with ext4/xfs on LVM myself). I remember when btrfs was going to be the one FS to rule them all, but then had issues, and specific weird cases (like with VM images IIRC at one point), and kind of fell of my map then. That is not intended as a criticism - filesystems are complex, and developing them hard... I think some of the reputation came from some people pushing btrfs before it was really ready.
On Fri, Jun 26, 2020 at 1:31 PM Chris Adams linux@cmadams.net wrote:
Once upon a time, Ben Cotton bcotton@redhat.com said:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
So... I freely admit I have not looked closely at btrfs in some time, so I could be out of date (and my apologies if so). One issue that I have seen mentioned as an issue within the last week is still the problem of running out of space when it still looks like there's space free. I didn't read the responses, so not sure of the resolution, but I remember that being a "thing" with btrfs. Is that still the case? What are the causes, and if so, how can we keep from getting a lot of the same question on mailing lists/forums/etc.?
Josef gave a fairly detailed answer upthread: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/...
However, I'll give some of my own color on this, as well. I have not personally experienced this issue on any of my systems in the past three years. I experienced it a couple of times when I first started out using it in 2014~2015, but it's not been a problem for me since.
We could stand to have some improved documentation here, and I hope this is something we can build up to support our user community. I'm sure there's some documentation from our friends at openSUSE that we can borrow as well.
I'm pretty neutral on this... I run a bunch of RHEL/CentOS systems, so I tend to stick close to that on my Fedora systems (so I'd probably stick with ext4/xfs on LVM myself). I remember when btrfs was going to be the one FS to rule them all, but then had issues, and specific weird cases (like with VM images IIRC at one point), and kind of fell of my map then. That is not intended as a criticism - filesystems are complex, and developing them hard... I think some of the reputation came from some people pushing btrfs before it was really ready.
I absolutely agree. I've often wondered if Btrfs would have a better reputation if it was developed for a few years behind closed doors before being unveiled. I think the way people perceive the filesysetm would be very different then.
Thankfully, I think today we're in a very good place with Btrfs upstream, and having Josef (an upstream Btrfs developer) helping drive this in Fedora makes me very confident in this change.
On 6/26/20 1:43 PM, Neal Gompa wrote:
One issue that I have seen mentioned as an issue within the last week is still the problem of running out of space when it still looks like there's space free. I didn't read the responses, so not sure of the resolution, but I remember that being a "thing" with btrfs. Is that still the case? What are the causes, and if so, how can we keep from getting a lot of the same question on mailing lists/forums/etc.?
Josef gave a fairly detailed answer upthread:
In this reply, he does not specifically address the disk-full issue, but I seem to remember that it was resolved. I couldn't however find a reference---could someone authoritatively say something one way or another?
On Fri, 26 Jun 2020 12:30:02 -0500 Chris Adams linux@cmadams.net wrote:
So... I freely admit I have not looked closely at btrfs in some time, so I could be out of date (and my apologies if so). One issue that I have seen mentioned as an issue within the last week is still the problem of running out of space when it still looks like there's space free. I didn't read the responses, so not sure of the resolution, but I remember that being a "thing" with btrfs. Is that still the case? What are the causes, and if so, how can we keep from getting a lot of the same question on mailing lists/forums/etc.?
Yes, it happened to me last week. The workstation has been upgraded since F25 and is now at F31. A yum update last week ran a restorecon -r / which filled up the filesystem and RAM and swap. The 460 GB filesystem had about 140GB of real data, 100 GB of data bloat from underfull blocks, and the rest (200GB) was metadata. I had to boot from a live USB and run btrfs balance to free up the bloat. I expect to reformat it to ext4 when the quarantine is over.
This is my last BTRFS filesystem. One was on a laptop hard disk that was painfully slow, especially when compared with it's ext4 twin sitting next to it. It was reformatted to ext4. I also had a BTRFS RAID 0 hard disk array. It was also slow and also ended up needing rescue. I converted it over to xfs on MD raid and it's been faster and perfectly reliable ever since.
While I like subvolumes and snapshots, I find the maintenance, reliability, and performance overhead to be not worth it.
Not recommended.
Jim
On Fri, Jun 26, 2020 at 12:58 pm, James Szinger jszinger@gmail.com wrote:
Yes, it happened to me last week. The workstation has been upgraded since F25 and is now at F31. A yum update last week ran a restorecon -r / which filled up the filesystem and RAM and swap. The 460 GB filesystem had about 140GB of real data, 100 GB of data bloat from underfull blocks, and the rest (200GB) was metadata. I had to boot from a live USB and run btrfs balance to free up the bloat. I expect to reformat it to ext4 when the quarantine is over.
Could the proposal owners comment on this, please? Sounds really bad.
On 6/26/20 2:58 PM, James Szinger wrote:
On Fri, 26 Jun 2020 12:30:02 -0500 Chris Adams linux@cmadams.net wrote:
So... I freely admit I have not looked closely at btrfs in some time, so I could be out of date (and my apologies if so). One issue that I have seen mentioned as an issue within the last week is still the problem of running out of space when it still looks like there's space free. I didn't read the responses, so not sure of the resolution, but I remember that being a "thing" with btrfs. Is that still the case? What are the causes, and if so, how can we keep from getting a lot of the same question on mailing lists/forums/etc.?
Yes, it happened to me last week. The workstation has been upgraded since F25 and is now at F31. A yum update last week ran a restorecon -r / which filled up the filesystem and RAM and swap. The 460 GB filesystem had about 140GB of real data, 100 GB of data bloat from underfull blocks, and the rest (200GB) was metadata. I had to boot from a live USB and run btrfs balance to free up the bloat. I expect to reformat it to ext4 when the quarantine is over.
This is my last BTRFS filesystem. One was on a laptop hard disk that was painfully slow, especially when compared with it's ext4 twin sitting next to it. It was reformatted to ext4. I also had a BTRFS RAID 0 hard disk array. It was also slow and also ended up needing rescue. I converted it over to xfs on MD raid and it's been faster and perfectly reliable ever since.
While I like subvolumes and snapshots, I find the maintenance, reliability, and performance overhead to be not worth it.
Not recommended.
Generally speaking btrfs performance has been the same if not better for our workloads. This is millions of boxes with thousands of different workloads and performance requirements.
That being said I can make btrfs look really stupid on some workloads. There's going to be cases where Btrfs isn't awesome. We still use xfs for all our storage related tiers (think databases). Performance is always going to be workload dependent, and Btrfs has built in overhead out the gate because of checksumming and the fact that we generate far more metadata.
As for your ENOSPC issue, I've made improvements on that area. I see this in production as well, I have monitoring in place to deal with the machine before it gets to this point. That being said if you run the box out of metadata space things get tricky to fix. I've been working my way down the list of issues in this area for years, this last go around of patches I sent were in these corner cases.
I described this case to the working group last week, because it hit us in production this winter. Somebody screwed up and suddenly pushed 2 extra copies of the whole website to everybody's VM. The website is mostly metadata, because of the inline extents, so it exhausted everybody's metadata space. Tens of thousands of machines affected. Of those machines I had to hand boot and run balance on ~20 of them to get them back. The rest could run balance from the automation and recover cleanly.
It's a shit user experience, and its a shitty corner case that still needs work. It's a top priority of mine. Thanks,
Josef
On Fri, Jun 26, 2020 at 03:22:07PM -0400, Josef Bacik wrote:
I described this case to the working group last week, because it hit us in production this winter. Somebody screwed up and suddenly pushed 2 extra copies of the whole website to everybody's VM. The website is mostly metadata, because of the inline extents, so it exhausted everybody's metadata space. Tens of thousands of machines affected. Of those machines I had to hand boot and run balance on ~20 of them to get them back. The rest could run balance from the automation and recover cleanly.
Is there a way to mitigate this by reserving space or setting quotas? Users running out of space on their laptops because:
* they downloaded a lot of media * they created huge vms * some sort of horrible log thing gone awry
are pretty common in both a) my anecdotal experience helping people professionally and personally and b) um, me.
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
On Fri, Jun 26, 2020 at 4:45 PM Matthew Miller mattdm@fedoraproject.org wrote:
On Fri, Jun 26, 2020 at 03:22:07PM -0400, Josef Bacik wrote:
I described this case to the working group last week, because it hit us in production this winter. Somebody screwed up and suddenly pushed 2 extra copies of the whole website to everybody's VM. The website is mostly metadata, because of the inline extents, so it exhausted everybody's metadata space. Tens of thousands of machines affected. Of those machines I had to hand boot and run balance on ~20 of them to get them back. The rest could run balance from the automation and recover cleanly.
Is there a way to mitigate this by reserving space or setting quotas? Users running out of space on their laptops because:
- they downloaded a lot of media
- they created huge vms
- some sort of horrible log thing gone awry
are pretty common in both a) my anecdotal experience helping people professionally and personally and b) um, me.
-- Matthew Miller mattdm@fedoraproject.org Fedora Project Leader _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Fri, Jun 26, 2020 at 6:11 PM Alex Thomas karlthane@gmail.com wrote:
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
We are planning on using Fedora's current default layout, which has a subvolume for / and a subvolume for /home.
Ok, I thought I saw a proposal by you to change the default btrfs layout to something like openSUSE's using subvolumes, but now, of course, I cannot find it.
On Fri, Jun 26, 2020 at 5:25 PM Neal Gompa ngompa13@gmail.com wrote:
On Fri, Jun 26, 2020 at 6:11 PM Alex Thomas karlthane@gmail.com wrote:
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
We are planning on using Fedora's current default layout, which has a subvolume for / and a subvolume for /home.
Ok, I thought I saw a proposal by you to change the default btrfs layout to something like openSUSE's using subvolumes, but now, of course, I cannot find it.
-- 真実はいつも一つ!/ Always, there's only one truth! _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Fri, Jun 26, 2020 at 6:37 PM Alex Thomas karlthane@gmail.com wrote:
On Fri, Jun 26, 2020 at 5:25 PM Neal Gompa ngompa13@gmail.com wrote:
On Fri, Jun 26, 2020 at 6:11 PM Alex Thomas karlthane@gmail.com wrote:
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
We are planning on using Fedora's current default layout, which has a subvolume for / and a subvolume for /home.
Ok, I thought I saw a proposal by you to change the default btrfs layout to something like openSUSE's using subvolumes, but now, of course, I cannot find it.
I have, at various points in the past couple of years, considered different subvolume configurations. Right now, I'm keeping it simple to our currently tested configuration: /boot on ext4, / and /home as btrfs subvolumes on a single btrfs volume.
The only modification I may consider would be moving /boot to be btrfs volume or subvolume, but that's contingent on some discussion with the installer and bootloader teams.
However, the existing configuration works *very* well right now.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Fri, 26 Jun 2020 at 23:21, Alex Thomas karlthane@gmail.com wrote:
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
BTW that layout. Anaconda still does not allow installing something like that because it does not allow /boot on btrfs (technically there is no any reasons to demand that and /boot can be just subvolume on the root btrfs pool).
kloczek
On Fri, Jun 26, 2020 at 5:31 PM Tomasz Kłoczko kloczko.tomasz@gmail.com wrote:
On Fri, 26 Jun 2020 at 23:21, Alex Thomas karlthane@gmail.com wrote:
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
BTW that layout. Anaconda still does not allow installing something like that because it does not allow /boot on btrfs (technically there is no any reasons to demand that and /boot can be just subvolume on the root btrfs pool).
kloczek
Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Something to think about when trying to set up for snapshot/rollbacks, then.
Le vendredi 26 juin 2020 à 23:28 +0100, Tomasz Kłoczko a écrit :
On Fri, 26 Jun 2020 at 23:21, Alex Thomas karlthane@gmail.com wrote:
Once question, are we looking at using a layout like openSUSE is doing? ( https://en.opensuse.org/SDB:BTRFS ) utilizing subvolumes, or are we looking at something like
/boot/efi > EFI (FAT32) / > btrfs
BTW that layout. Anaconda still does not allow installing something like that because it does not allow /boot on btrfs (technically there is no any reasons to demand that and /boot can be just subvolume on the root btrfs pool).
Anaconda will detect you’re reusing an efi partition, and complain it does not fit its requirements of the day, and force you to recreate it from scratch, blowing up the EFI parts installed by other systems for their own boot in the process.
Thus, Anaconda EFI support is terrible period.
On 6/26/20 5:44 PM, Matthew Miller wrote:
On Fri, Jun 26, 2020 at 03:22:07PM -0400, Josef Bacik wrote:
I described this case to the working group last week, because it hit us in production this winter. Somebody screwed up and suddenly pushed 2 extra copies of the whole website to everybody's VM. The website is mostly metadata, because of the inline extents, so it exhausted everybody's metadata space. Tens of thousands of machines affected. Of those machines I had to hand boot and run balance on ~20 of them to get them back. The rest could run balance from the automation and recover cleanly.
Is there a way to mitigate this by reserving space or setting quotas? Users running out of space on their laptops because:
- they downloaded a lot of media
- they created huge vms
- some sort of horrible log thing gone awry
are pretty common in both a) my anecdotal experience helping people professionally and personally and b) um, me.
There's a difference between data ENOSPC and metadata ENOSPC. And again, this is a pretty specific failure case. Obviously it's not impossible to hit, but it's not something that's going to be a common occurrence. The two times we've hit these issues in production was the thing that I mentioned, which had 750gib fs's completely full with 20gib of metadata completely filled up.
The second was a bad service that was spewing empty files onto the disk slowly filling up the metadata chunks, coupled with a bug in how we allocated data and metadata chunks. The chunk allocation thing has been fixed for a year or two now. This isn't something a normal user is going to hit most of the time. It obviously does happen, I'm aware of it, and I've made progress on making it less likely to get you into a "call Josef" situation. I'm sure there's still more work to be done, but there's continual progress on this particular edgecase. Thanks,
Josef
On Fri, Jun 26, 2020 at 3:44 PM Matthew Miller mattdm@fedoraproject.org wrote:
On Fri, Jun 26, 2020 at 03:22:07PM -0400, Josef Bacik wrote:
I described this case to the working group last week, because it hit us in production this winter. Somebody screwed up and suddenly pushed 2 extra copies of the whole website to everybody's VM. The website is mostly metadata, because of the inline extents, so it exhausted everybody's metadata space. Tens of thousands of machines affected. Of those machines I had to hand boot and run balance on ~20 of them to get them back. The rest could run balance from the automation and recover cleanly.
Is there a way to mitigate this by reserving space or setting quotas? Users running out of space on their laptops because:
- they downloaded a lot of media
- they created huge vms
- some sort of horrible log thing gone awry
are pretty common in both a) my anecdotal experience helping people professionally and personally and b) um, me.
Real out of space can happen on any file system. Bogus enospc on btrfs due to edge cases hitting bugs are less common than real enospc due to the current partitioning arrangement creating competition between /home and / free space - which won't exist with btrfs. I expect a net reduction of out of space as a result of the change.
There is a reserve in btrfs to help make sure if you do get to a real out of space condition, that the file system (a) stays read write and (b) can be backed out of the full condition by deleting files and successfully freeing up space. Edge cases where this doesn't work are bugs, and there are some non-obvious ways to back out of it if someone does hit one.
The old stories on #btrfs and linux-btrfs@ do include cases of a file system that goes read-only, can't be remounted read-write, and you have to backup->reformat->restore. And that is a PITA. But also not data loss.
* Josef Bacik:
As for your ENOSPC issue, I've made improvements on that area. I see this in production as well, I have monitoring in place to deal with the machine before it gets to this point. That being said if you run the box out of metadata space things get tricky to fix. I've been working my way down the list of issues in this area for years, this last go around of patches I sent were in these corner cases.
Is there anything we need to do in userspace to improve the behavior of fflush and similar interfaces?
This is not strictly a btrfs issue: Some of us are worried about scenarios where the write system call succeeds and the data never makes it to storage *without a catastrophic failure*. (I do not consider running out of disk space a catastrophic failure.) NFS apparently has this property, and you have to call fsync or close the descriptor to detect this. fsync is not desirable due to its performance impact.
Hi,
On 27/06/2020 11:00, Florian Weimer wrote:
- Josef Bacik:
As for your ENOSPC issue, I've made improvements on that area. I see this in production as well, I have monitoring in place to deal with the machine before it gets to this point. That being said if you run the box out of metadata space things get tricky to fix. I've been working my way down the list of issues in this area for years, this last go around of patches I sent were in these corner cases.
Is there anything we need to do in userspace to improve the behavior of fflush and similar interfaces?
This is not strictly a btrfs issue: Some of us are worried about scenarios where the write system call succeeds and the data never makes it to storage *without a catastrophic failure*. (I do not consider running out of disk space a catastrophic failure.) NFS apparently has this property, and you have to call fsync or close the descriptor to detect this. fsync is not desirable due to its performance impact.
It doesn't matter which filesystem you use, you can't be sure that the data is really safe on disk without calling fsync. In the case of a new inode, that means fsync on the file and on the containing directory.
There can be performance issues depending on how that is done, however there are a number of solutions to those issues which can reduce the performance effects to the point where they are usually no longer a problem. That is with the caveat that slow storage will always be slow, of course!
The usual tricks are to avoid doing lots of small fsyncs, by gathering up smaller files, ideally sorting them into inode number order for local filesystems, and then issuing fsyncs asynchronously, waiting for them all only once all the fsyncs have been issued. Also fadvise/madvise can be useful in these situations too,
Steve.
* Steven Whitehouse:
On 27/06/2020 11:00, Florian Weimer wrote:
- Josef Bacik:
As for your ENOSPC issue, I've made improvements on that area. I see this in production as well, I have monitoring in place to deal with the machine before it gets to this point. That being said if you run the box out of metadata space things get tricky to fix. I've been working my way down the list of issues in this area for years, this last go around of patches I sent were in these corner cases.
Is there anything we need to do in userspace to improve the behavior of fflush and similar interfaces?
This is not strictly a btrfs issue: Some of us are worried about scenarios where the write system call succeeds and the data never makes it to storage *without a catastrophic failure*. (I do not consider running out of disk space a catastrophic failure.) NFS apparently has this property, and you have to call fsync or close the descriptor to detect this. fsync is not desirable due to its performance impact.
It doesn't matter which filesystem you use, you can't be sure that the data is really safe on disk without calling fsync. In the case of a new inode, that means fsync on the file and on the containing directory.
In my opinion, there is a conceptual difference between the machine or storage crashing hard, and just running out of disk space.
There can be performance issues depending on how that is done, however there are a number of solutions to those issues which can reduce the performance effects to the point where they are usually no longer a problem. That is with the caveat that slow storage will always be slow, of course!
The usual tricks are to avoid doing lots of small fsyncs, by gathering up smaller files, ideally sorting them into inode number order for local filesystems, and then issuing fsyncs asynchronously, waiting for them all only once all the fsyncs have been issued. Also fadvise/madvise can be useful in these situations too,
None of this applies to shell utilities such as grep and cat. They work around data loss as a result of the write system call not reporting ENOSPC errors: they close stdout and stderr underneath glibc, which leads to a different class of problems. It turns out that on Linux, close does more space checks than write, so this allows the shell utilities to check for ENOSPC without issuing fsyncs. At present, lack of space checks from write seems to primarily happen with NFS.
So let me rephrase: Does btrfs report ENOSPC during write? If it does not, what can we do to check for sufficient space during fflush and similar operations?
If we change the shell utilities to do an fsync on close, we get traditional UNIX behavior with traditional UNIX performance. I don't think that's what people want.
Thanks, Florian
* Josef Bacik:
That being said I can make btrfs look really stupid on some workloads. There's going to be cases where Btrfs isn't awesome. We still use xfs for all our storage related tiers (think databases). Performance is always going to be workload dependent, and Btrfs has built in overhead out the gate because of checksumming and the fact that we generate far more metadata.
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Thanks, Florian
On 6/29/20 5:33 AM, Florian Weimer wrote:
- Josef Bacik:
That being said I can make btrfs look really stupid on some workloads. There's going to be cases where Btrfs isn't awesome. We still use xfs for all our storage related tiers (think databases). Performance is always going to be workload dependent, and Btrfs has built in overhead out the gate because of checksumming and the fact that we generate far more metadata.
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Yes that's correct. At our scale everything falls over, including XFS, and as I've stated elsewhere in this thread we actually see a higher rate of failure (relative to the install size) with XFS. The databases we use already do all of the fancy things that btrfs does in the application. If we could get away with it we'd just use raw disks for those applications. and in fact may do that in the future. Thanks,
Josef
On 6/29/20 8:39 AM, Josef Bacik wrote:
On 6/29/20 5:33 AM, Florian Weimer wrote:
- Josef Bacik:
That being said I can make btrfs look really stupid on some workloads. There's going to be cases where Btrfs isn't awesome. We still use xfs for all our storage related tiers (think databases). Performance is always going to be workload dependent, and Btrfs has built in overhead out the gate because of checksumming and the fact that we generate far more metadata.
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Yes that's correct. At our scale everything falls over, including XFS, and as I've stated elsewhere in this thread we actually see a higher rate of failure (relative to the install size) with XFS. The databases we use already do all of the fancy things that btrfs does in the application. If we could get away with it we'd just use raw disks for those applications. and in fact may do that in the future. Thanks,
Josef, with my XFS hat on, are these recent failures? Have they all been reported to the XFS list?
It makes sense to look at reliability in the context of this thread, but offering "btrfs fails less often than XFS for us" without any context (what kind of failure, what kernel, when, etc) doesn't help much, it's just more anecdotes.
Thanks, -Eric
On 6/29/20 2:23 PM, Eric Sandeen wrote:
On 6/29/20 8:39 AM, Josef Bacik wrote:
On 6/29/20 5:33 AM, Florian Weimer wrote:
- Josef Bacik:
That being said I can make btrfs look really stupid on some workloads. There's going to be cases where Btrfs isn't awesome. We still use xfs for all our storage related tiers (think databases). Performance is always going to be workload dependent, and Btrfs has built in overhead out the gate because of checksumming and the fact that we generate far more metadata.
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Yes that's correct. At our scale everything falls over, including XFS, and as I've stated elsewhere in this thread we actually see a higher rate of failure (relative to the install size) with XFS. The databases we use already do all of the fancy things that btrfs does in the application. If we could get away with it we'd just use raw disks for those applications. and in fact may do that in the future. Thanks,
Josef, with my XFS hat on, are these recent failures? Have they all been reported to the XFS list?
It makes sense to look at reliability in the context of this thread, but offering "btrfs fails less often than XFS for us" without any context (what kind of failure, what kernel, when, etc) doesn't help much, it's just more anecdotes.
Yup this is why I try to avoid talking about other file systems. This shouldn't be interpreted as "XFS drools, btrfs rules!", just that in our own environment, btrfs does not fail at any significant rate higher than xfs.
Xfs is used in completely different workloads, and with completely different (much better) hardware.
And the reason they haven't been brought up to the list is because it fails at such a low rate that I didn't even realize we were having xfs reprovisions until I went and looked at the data. So far of the 15 machines that fell over, 10 of them appear to be hardware related. The other 5 have logs that are in a different database that take longer to pull out. Thanks,
Josef
On 6/29/20 1:47 PM, Josef Bacik wrote:
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Yes that's correct. At our scale everything falls over, including XFS, and as I've stated elsewhere in this thread we actually see a higher rate of failure (relative to the install size) with XFS. The databases we use already do all of the fancy things that btrfs does in the application. If we could get away with it we'd just use raw disks for those applications. and in fact may do that in the future. Thanks,
Josef, with my XFS hat on, are these recent failures? Have they all been reported to the XFS list?
It makes sense to look at reliability in the context of this thread, but offering "btrfs fails less often than XFS for us" without any context (what kind of failure, what kernel, when, etc) doesn't help much, it's just more anecdotes.
Yup this is why I try to avoid talking about other file systems. This shouldn't be interpreted as "XFS drools, btrfs rules!", just that in our own environment, btrfs does not fail at any significant rate higher than xfs.
Xfs is used in completely different workloads, and with completely different (much better) hardware.
And the reason they haven't been brought up to the list is because it fails at such a low rate that I didn't even realize we were having xfs reprovisions until I went and looked at the data. So far of the 15 machines that fell over, 10 of them appear to be hardware related. The other 5 have logs that are in a different database that take longer to pull out. Thanks,
Josef
Thanks for the context, Josef, I appreciate it.
-Eric
On Mon, Jun 29, 2020 at 11:33:40AM +0200, Florian Weimer wrote:
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Be careful about overloading quite a few definitions into the single word "reliability".
You seem to be referring to btrfs features like file checksumming that can detect silent corruption, and automagically fix things if you've enabled the equally automagic RAID1-like features. (Which, for the record, I think are really frickin' awesome!)
But what good is btrfs' attestation of file integrity when it craps itself to the point where it doesn't even know those files even _exist_ anymore? How can we brag about robustness in the face of cosmic rays or recovery from the power cord getting yanked when it couldn't reliably _remount_ a lightly-used, cleanly unmounted filesystem?
By that "reliability" metric, for me XFS has been infinitely better than btrfs; sure XFS can't automagically tell me if an individual file got corrupted (much less fix it) but it's also never eaten entire filesystems across clean unmount/mount cycles. Whereas btrfs has done so, Twice.
I realize this is several-years-out-of-date anectdata, but it's the sort of thing that has given btrfs (quite deservedly) a very bad reputation.
(BTW, I didn't try using XFS until after my second bad btrfs experience. It hasn't so much as hiccupped me since, proving to be the robust filesytem I've ever used...)
I concede that my experience is outdated, and am willing to take the btrfs authors at their word that the bugs that led to my filesystems eating themselves have been fixed, but.. sure, the btrfs proponents say it's "ready for production" now, but they also said that back then, too.
So. Instead of making btrfs the default for F33, perhaps a better approach is is to plan to make it the default for F34, and use the F33 cycle to encourage folks (eg release notes, installer prompting?) to try using btrfs.
The point here is not F32 vs F33 or whatever, but that of _time_ -- I don't think there's enough time between now and the F33 go/no point for folks like me to set up and sufficiently burn-in F32 btrfs systems to gain confidence that btrfs is indeed ready. (In any case, the traditional beta period is _way_ too short for something like this!)
- Solomon
* Solomon Peachy:
On Mon, Jun 29, 2020 at 11:33:40AM +0200, Florian Weimer wrote:
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Be careful about overloading quite a few definitions into the single word "reliability".
You seem to be referring to btrfs features like file checksumming that
No, I was not. To me, for file systems, it means that under conditions I personally consider reasonable (generally healthy hardware, and only the occasional hard power-off after a system becomes unresponsive), the file system can be mounted, retains consistent metadata, and most of the data is still there, with the possible exception of things that have been written a short time after the crash.
It's not about getting the best out of partially faulty hardware or an execution environment with frequent power outages.
As you point out, historically, checksumming file systems weren't very good at this, but I think for btrfs, this has improved. (For a long time, there was a FAQ for a different checksumming file system that had something along the lines of “Q: I can't mount my file system due to a checksum error. A: Restore form backup.”.)
Thanks, Florian
On Mon, Jun 29, 2020 at 7:55 AM Solomon Peachy pizza@shaftnet.org wrote:
On Mon, Jun 29, 2020 at 11:33:40AM +0200, Florian Weimer wrote:
Just to be clear here, the choice of XFS here is purely based on performance, not on the reliability of the file systems, right? (So it's not “all the really important data is stored in XFS”.)
Be careful about overloading quite a few definitions into the single word "reliability".
You seem to be referring to btrfs features like file checksumming that can detect silent corruption, and automagically fix things if you've enabled the equally automagic RAID1-like features. (Which, for the record, I think are really frickin' awesome!)
But what good is btrfs' attestation of file integrity when it craps itself to the point where it doesn't even know those files even _exist_ anymore?
You've got an example where 'btrfs restore' saw no files at all? And you think it's the file system rather than the hardware, why?
I think this is the wrong metaphor because it suggests btrfs caused the crapping. The sequence is: btrfs does the right thing, drive firmware craps itself and there's a power failure or a crash. Btrfs in the ordinary case doesn't care and boots without complaint. In the far less common case some critical node just happened to get nerfed and there's no way to automatically recover. The user is left on an island. This part should get better anyway, even though it can happen with any file system.
And as a community we need the user to user support to make sure folks aren't left on an island - can we do that? This is the question. It really is a community question more than it is a technology question.
How can we brag about robustness in the face of cosmic rays or recovery from the power cord getting yanked when it couldn't reliably _remount_ a lightly-used, cleanly unmounted filesystem?
Come on. It's cleanly unmounted and doesn't mount?
I guess you missed the other emails about dm-log-writes and xfstests, but they directly relate here. Josef relayed that all of his deep dives into Btrfs failures since the dm-log-writes work, have all been traced back to hardware doing the wrong thing.
All file systems have write ordering expectations. If the hardware doesn't honor that, it's trouble if there's a crash. What you're describing is 100% a hardware crapped itself case. You said it cleanly unmounted i.e. the exact correct write ordering did happen. And yet the file system can't be mounted again. That's a hardware failure.
I realize this is several-years-out-of-date anectdata, but it's the sort of thing that has given btrfs (quite deservedly) a very bad reputation.
The frustration and skepticism are palpable. But here is the problem with the road you're going down: you are arguing in favor of closed door development practices. Keep all the scary early development out of public scrutiny, as a form of messaging control, so that reputation isn't damaged by knowing about all the sausage making.
The point here is not F32 vs F33 or whatever, but that of _time_ -- I don't think there's enough time between now and the F33 go/no point for folks like me to set up and sufficiently burn-in F32 btrfs systems to gain confidence that btrfs is indeed ready. (In any case, the traditional beta period is _way_ too short for something like this!)
There is no way for one person to determine if Btrfs is ready. That's done by combination of synthetic tests (xfstests) and volume regression testing on actual workloads. And by the way the Red Hat CKI project is going to help run btrfs xfstests for Fedora kernels.
The questions are whether the Fedora community wants and is ready for Btrfs by default.
On Mon, Jun 29, 2020 at 10:26:37AM -0600, Chris Murphy wrote:
You've got an example where 'btrfs restore' saw no files at all? And you think it's the file system rather than the hardware, why?
Because the system failed to boot up, and even after offline repair attempts was still missing a sufficiently large chunk of the root filesystem to necessitate re-installation.
Because the same hardware provided literally years of problem-free stability with ext4 (before) and xfs (after).
I think this is the wrong metaphor because it suggests btrfs caused the crapping. The sequence is: btrfs does the right thing, drive firmware craps itself and there's a power failure or a crash. Btrfs in the ordinary case doesn't care and boots without complaint. In the far
The first time, I needed to physically move the system, so the machine was shut down via 'shutdown -h now' on a console, and didn't come back up.
The second time was a routine post-dnf-update 'reboot', without power cycling anything.
At no point was there ever any unclean shutdown, and at the time of those reboots, no errors were reported in the kernel logs.
Once is a fluke, twice is a trend... and I didn't have the patience for a third try because I needed to be able to rely on the system to not eat itself.
I can't get the complete details at the moment, but it was an AMD E-350 system with an 32GB ADATA SATA, configured using anaconda's btrfs defaults and only about 30% of disk space used. Pretty minimal I/O.
I will concede that it's possible there was/is some sort hardware/firmware bug, but if so, only btrfs seemed to trigger it.
Come on. It's cleanly unmounted and doesn't mount?
Yes. (See above)
(Granted, I'm using "mount" to mean "successfully mounted a writable filesystem with data largely intact" -- I'm a bit fuzzy on the exact details but I believe the it did mount read-only before the boot crapped out due to missing/inaccessable system libraries. I had to resort to a USB stick to attempt repairs that were only partially successful)
All file systems have write ordering expectations. If the hardware doesn't honor that, it's trouble if there's a crash. What you're describing is 100% a hardware crapped itself case. You said it cleanly unmounted i.e. the exact correct write ordering did happen. And yet the file system can't be mounted again. That's a hardware failure.
That may be the case, but when there were no crashes, and neither ext4 nor xfs crapped themselves under day-to-day operation with the same hardware, it's reasonable to infer that the problem has _something_ to do with the variable that changed, ie btrfs.
There is no way for one person to determine if Btrfs is ready. That's done by combination of synthetic tests (xfstests) and volume regression testing on actual workloads. And by the way the Red Hat CKI project is going to help run btrfs xfstests for Fedora kernels.
Of course not, but the Fedora commnuity is made up of innumerable "one persons" each responsible for several special snowflake systems.
Let's say for sake of argument that my bad btrfs experiences were due to bugs in device firmware with btrfs's completely-legal usage patterns rather than bugs in btrfs-from-five-years-ago. That's great... except my system still got trashed to the point of needing to be reinstalled, and finger-pointing can't bring back lost data.
How many more special snowflake drives are out there? Think about how long it took Fedora to enable TRIM out of concern for potential data loss. Why should this be any different?
(We're always going to be stuck with buggy firmware. FFS, the Samsung 860 EVO SATA SSD that I have in my main workstation will hiccup to the point of trashing data when used with AMD SATA controllers... even under Windows! Their official support answer is "Use an Intel controller". And that's a tier-one manufacturer who presumably has among the best QA and support in the industry..)
If there is device/firmware known to be problematic, we need to keep track of these buggy devices and either automatically provide workarounds or some way to tell the user that proceeding with btrfs may be perilous to their data.
(Or perhaps the issues I had were due to bugs in btrfs-of-five-years-ago that have long since been fixed. Either way, given my twice-burned experiences, I would want to verify that for myself before I entrust it with any data I care about...)
The questions are whether the Fedora community wants and is ready for Btrfs by default.
There are obviously some folks here (myself included) that have had very negative btrfs experiences. Similarly, there are folks that have successfully overseen large-scale deployements of btrfs in their managed enviroments (not on Fedora though, IIUC)
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
- Solomon
On Mon, Jun 29, 2020 at 03:15:23PM -0400, Solomon Peachy wrote:
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Zbyszek
Hi,
On 01/07/2020 07:54, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jun 29, 2020 at 03:15:23PM -0400, Solomon Peachy wrote:
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Zbyszek
Indeed, it is an important point, and taking care is very important when dealing with other people's data, which is in effect what we are discussing here.
When we looked at btrfs support in RHEL, we took quite a long time over it. In fact I'm not quite sure how long, since the process had started before I was involved, but it was not a decision that was made quickly, and a great deal of thought went into it. It was difficult to get concrete information about the stability aspects at the time. Just like the discussions that have taken place on this thread, there was a lot of anecdotal evidence, but that is not always a good indicator. Since time has passed since then, and there is now more evidence, this part of the process should be easier. That said to get a meaningful comparison then ideally one would want to compare on the basis of user populations of similar size and technical skill level, and look not just at the overall number of bugs reported, but at the rate those bugs are being reported too.
It is often tricky to be sure of the root cause of bugs - just because a filesystem reports an error doesn't mean that it is at fault, it might be a hardware problem, or an issue with volume management. Figuring out where the real problem lies is often very time consuming work. Without that work though, the raw numbers of bugs reported can be very misleading.
It is also worth noting that when we made the decision for RHEL it was not just a question of stability, although that is obviously an important consideration. We looked at a wide range of factors, including the overall design and features. We had reached out to a number of potential users and asked them what features they wanted from their filesystems and tried to understand where we had gaps in our existing offerings. It would be worth taking that step here, and asking each of the spins what are the features that they would most like to see from the storage/fs stack. Comparing filesystems in the abstract is a difficult task, and it is much easier against a context. I know that some of the issues have already been discussed in this thread, but maybe if someone was to gather up a list of requirements from those messages then that would help to direct further discussion,
Steve.
On Wed, Jul 01, 2020 at 11:28:10AM +0100, Steven Whitehouse wrote:
Hi,
On 01/07/2020 07:54, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jun 29, 2020 at 03:15:23PM -0400, Solomon Peachy wrote:
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Zbyszek
Indeed, it is an important point, and taking care is very important when dealing with other people's data, which is in effect what we are discussing here.
When we looked at btrfs support in RHEL, we took quite a long time over it. In fact I'm not quite sure how long, since the process had started before I was involved, but it was not a decision that was made quickly, and a great deal of thought went into it. It was difficult to get concrete information about the stability aspects at the time. Just like the discussions that have taken place on this thread, there was a lot of anecdotal evidence, but that is not always a good indicator. Since time has passed since then, and there is now more evidence, this part of the process should be easier. That said to get a meaningful comparison then ideally one would want to compare on the basis of user populations of similar size and technical skill level, and look not just at the overall number of bugs reported, but at the rate those bugs are being reported too.
Yeah. I have no doubt that the decision was made carefully back then. That said, time has passed, and btrfs has evolved and our use cases have evolved too, so a fresh look is good.
We have https://fedoraproject.org/wiki/Changes/DNF_Better_Counting, maybe this could be used to collect some statistics about the fs type too.
It is often tricky to be sure of the root cause of bugs - just because a filesystem reports an error doesn't mean that it is at fault, it might be a hardware problem, or an issue with volume management. Figuring out where the real problem lies is often very time consuming work. Without that work though, the raw numbers of bugs reported can be very misleading.
It would be worth taking that step here, and asking each of the spins what are the features that they would most like to see from the storage/fs stack. Comparing filesystems in the abstract is a difficult task, and it is much easier against a context. I know that some of the issues have already been discussed in this thread, but maybe if someone was to gather up a list of requirements from those messages then that would help to direct further discussion,
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users and we completely sidestep it with this change. The change page lists a bunch of other benefits, incl. better integration with the new resource allocation mechanisms we have with cgroups2. So in a way this is a follow-up to the cgroupsv2-by-default change in F31. Snapshots and subvolumes also give additional powers to systemd-nspawn and other tools. I'd say that the huge potential of btrfs is clear. It's the possibility of the loss of stability that is my (and others') worry and the thing which is hard to gauge.
Zbyszek
Hi,
On 01/07/2020 12:09, Zbigniew Jędrzejewski-Szmek wrote:
On Wed, Jul 01, 2020 at 11:28:10AM +0100, Steven Whitehouse wrote:
Hi,
On 01/07/2020 07:54, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jun 29, 2020 at 03:15:23PM -0400, Solomon Peachy wrote:
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Zbyszek
Indeed, it is an important point, and taking care is very important when dealing with other people's data, which is in effect what we are discussing here.
When we looked at btrfs support in RHEL, we took quite a long time over it. In fact I'm not quite sure how long, since the process had started before I was involved, but it was not a decision that was made quickly, and a great deal of thought went into it. It was difficult to get concrete information about the stability aspects at the time. Just like the discussions that have taken place on this thread, there was a lot of anecdotal evidence, but that is not always a good indicator. Since time has passed since then, and there is now more evidence, this part of the process should be easier. That said to get a meaningful comparison then ideally one would want to compare on the basis of user populations of similar size and technical skill level, and look not just at the overall number of bugs reported, but at the rate those bugs are being reported too.
Yeah. I have no doubt that the decision was made carefully back then. That said, time has passed, and btrfs has evolved and our use cases have evolved too, so a fresh look is good.
We have https://fedoraproject.org/wiki/Changes/DNF_Better_Counting, maybe this could be used to collect some statistics about the fs type too.
Yes, and also the questions that Fedora is trying to answer are different too. So I don't think that our analysis for RHEL is applicable here in general. The method that we went through, in general terms, may potentially be helpful.
It is often tricky to be sure of the root cause of bugs - just because a filesystem reports an error doesn't mean that it is at fault, it might be a hardware problem, or an issue with volume management. Figuring out where the real problem lies is often very time consuming work. Without that work though, the raw numbers of bugs reported can be very misleading. It would be worth taking that step here, and asking each of the spins what are the features that they would most like to see from the storage/fs stack. Comparing filesystems in the abstract is a difficult task, and it is much easier against a context. I know that some of the issues have already been discussed in this thread, but maybe if someone was to gather up a list of requirements from those messages then that would help to direct further discussion,
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users and we completely sidestep it with this change. The change page lists a bunch of other benefits, incl. better integration with the new resource allocation mechanisms we have with cgroups2. So in a way this is a follow-up to the cgroupsv2-by-default change in F31. Snapshots and subvolumes also give additional powers to systemd-nspawn and other tools. I'd say that the huge potential of btrfs is clear. It's the possibility of the loss of stability that is my (and others') worry and the thing which is hard to gauge.
Zbyszek
If the / and /home split is the main issue, then dm-thin might be an alternative solution, and we should check to see if some of the issues listed on the change page have been addressed. I'm copying in Jon for additional comment on that. Are those btrfs benefits which are listed on the change page in priority order?
File system resize is mentioned there, but pretty much all local filesystems support grow. Also, no use cases are listed for that benefit. Shrink is more tricky, and can easily result in poor file layouts, particularly if there are repeated grow/shrink operations, not to mention potential complications with NFS if the fs is exported. So is there some specific use case there that cannot be supported easily with the existing tools? There are a few other features listed that are available in other fs/volume management tools as well.
Eric has already pointed out that XFS has cgroups2 support, so the statement that btrfs is the only fs with that is incorrect. It would help to make things a bit clearer if that list was updated, with the information gathered so far,
Steve.
On 7/1/20 7:49 AM, Steven Whitehouse wrote:
Hi,
On 01/07/2020 12:09, Zbigniew Jędrzejewski-Szmek wrote:
On Wed, Jul 01, 2020 at 11:28:10AM +0100, Steven Whitehouse wrote:
Hi,
On 01/07/2020 07:54, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jun 29, 2020 at 03:15:23PM -0400, Solomon Peachy wrote:
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Zbyszek
Indeed, it is an important point, and taking care is very important when dealing with other people's data, which is in effect what we are discussing here.
When we looked at btrfs support in RHEL, we took quite a long time over it. In fact I'm not quite sure how long, since the process had started before I was involved, but it was not a decision that was made quickly, and a great deal of thought went into it. It was difficult to get concrete information about the stability aspects at the time. Just like the discussions that have taken place on this thread, there was a lot of anecdotal evidence, but that is not always a good indicator. Since time has passed since then, and there is now more evidence, this part of the process should be easier. That said to get a meaningful comparison then ideally one would want to compare on the basis of user populations of similar size and technical skill level, and look not just at the overall number of bugs reported, but at the rate those bugs are being reported too.
Yeah. I have no doubt that the decision was made carefully back then. That said, time has passed, and btrfs has evolved and our use cases have evolved too, so a fresh look is good.
We have https://fedoraproject.org/wiki/Changes/DNF_Better_Counting, maybe this could be used to collect some statistics about the fs type too.
Yes, and also the questions that Fedora is trying to answer are different too. So I don't think that our analysis for RHEL is applicable here in general. The method that we went through, in general terms, may potentially be helpful.
It is often tricky to be sure of the root cause of bugs - just because a filesystem reports an error doesn't mean that it is at fault, it might be a hardware problem, or an issue with volume management. Figuring out where the real problem lies is often very time consuming work. Without that work though, the raw numbers of bugs reported can be very misleading. It would be worth taking that step here, and asking each of the spins what are the features that they would most like to see from the storage/fs stack. Comparing filesystems in the abstract is a difficult task, and it is much easier against a context. I know that some of the issues have already been discussed in this thread, but maybe if someone was to gather up a list of requirements from those messages then that would help to direct further discussion,
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users and we completely sidestep it with this change. The change page lists a bunch of other benefits, incl. better integration with the new resource allocation mechanisms we have with cgroups2. So in a way this is a follow-up to the cgroupsv2-by-default change in F31. Snapshots and subvolumes also give additional powers to systemd-nspawn and other tools. I'd say that the huge potential of btrfs is clear. It's the possibility of the loss of stability that is my (and others') worry and the thing which is hard to gauge.
Zbyszek
If the / and /home split is the main issue, then dm-thin might be an alternative solution, and we should check to see if some of the issues listed on the change page have been addressed. I'm copying in Jon for additional comment on that. Are those btrfs benefits which are listed on the change page in priority order?
File system resize is mentioned there, but pretty much all local filesystems support grow. Also, no use cases are listed for that benefit. Shrink is more tricky, and can easily result in poor file layouts, particularly if there are repeated grow/shrink operations, not to mention potential complications with NFS if the fs is exported. So is there some specific use case there that cannot be supported easily with the existing tools? There are a few other features listed that are available in other fs/volume management tools as well.
Eric has already pointed out that XFS has cgroups2 support, so the statement that btrfs is the only fs with that is incorrect. It would help to make things a bit clearer if that list was updated, with the information gathered so far,
Yeah that should be changed.
There's a big gap between having cgroups2 support and it actually working. The thing that I've said consistently is that there's nothing keeping XFS from working with cgroups2, it's just that we (Facebook) haven't tested it, because at the time we were rolling it out it didn't have writeback support.
Even btrfs with writeback support enabled still required a few investigations and follow up work to get everything working properly, because you don't ever know what's going to break until you actually use it. So while XFS technically has support, Btrfs is the only fs that we use cgroup2 with IO isolation in production, so it's the only thing we're comfortable with. XFS may work perfectly fine, but AFAIK nobody has ever tested it or used it in production. Thanks,
Josef
On 7/1/20 9:24 AM, Josef Bacik wrote:
On 7/1/20 7:49 AM, Steven Whitehouse wrote:
Hi,
On 01/07/2020 12:09, Zbigniew Jędrzejewski-Szmek wrote:
On Wed, Jul 01, 2020 at 11:28:10AM +0100, Steven Whitehouse wrote:
Hi,
On 01/07/2020 07:54, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jun 29, 2020 at 03:15:23PM -0400, Solomon Peachy wrote:
So yes, I think an explicit "let's all test btrfs (as anaconda configures it) before we make it default" period is warranted.
Perhaps one can argue that Fedora has already been doing that for the past two years (since 2018-or-later-btrfs is what everyone with positive results appears to be talking about), but it's still not clear that those deployments utilize the same feature set as Fedora's defaults, and how broad the hardware sample is.
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Zbyszek
Indeed, it is an important point, and taking care is very important when dealing with other people's data, which is in effect what we are discussing here.
When we looked at btrfs support in RHEL, we took quite a long time over it. In fact I'm not quite sure how long, since the process had started before I was involved, but it was not a decision that was made quickly, and a great deal of thought went into it. It was difficult to get concrete information about the stability aspects at the time. Just like the discussions that have taken place on this thread, there was a lot of anecdotal evidence, but that is not always a good indicator. Since time has passed since then, and there is now more evidence, this part of the process should be easier. That said to get a meaningful comparison then ideally one would want to compare on the basis of user populations of similar size and technical skill level, and look not just at the overall number of bugs reported, but at the rate those bugs are being reported too.
Yeah. I have no doubt that the decision was made carefully back then. That said, time has passed, and btrfs has evolved and our use cases have evolved too, so a fresh look is good.
We have https://fedoraproject.org/wiki/Changes/DNF_Better_Counting, maybe this could be used to collect some statistics about the fs type too.
Yes, and also the questions that Fedora is trying to answer are different too. So I don't think that our analysis for RHEL is applicable here in general. The method that we went through, in general terms, may potentially be helpful.
It is often tricky to be sure of the root cause of bugs - just because a filesystem reports an error doesn't mean that it is at fault, it might be a hardware problem, or an issue with volume management. Figuring out where the real problem lies is often very time consuming work. Without that work though, the raw numbers of bugs reported can be very misleading. It would be worth taking that step here, and asking each of the spins what are the features that they would most like to see from the storage/fs stack. Comparing filesystems in the abstract is a difficult task, and it is much easier against a context. I know that some of the issues have already been discussed in this thread, but maybe if someone was to gather up a list of requirements from those messages then that would help to direct further discussion,
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users and we completely sidestep it with this change. The change page lists a bunch of other benefits, incl. better integration with the new resource allocation mechanisms we have with cgroups2. So in a way this is a follow-up to the cgroupsv2-by-default change in F31. Snapshots and subvolumes also give additional powers to systemd-nspawn and other tools. I'd say that the huge potential of btrfs is clear. It's the possibility of the loss of stability that is my (and others') worry and the thing which is hard to gauge.
Zbyszek
If the / and /home split is the main issue, then dm-thin might be an alternative solution, and we should check to see if some of the issues listed on the change page have been addressed. I'm copying in Jon for additional comment on that. Are those btrfs benefits which are listed on the change page in priority order?
File system resize is mentioned there, but pretty much all local filesystems support grow. Also, no use cases are listed for that benefit. Shrink is more tricky, and can easily result in poor file layouts, particularly if there are repeated grow/shrink operations, not to mention potential complications with NFS if the fs is exported. So is there some specific use case there that cannot be supported easily with the existing tools? There are a few other features listed that are available in other fs/volume management tools as well.
Eric has already pointed out that XFS has cgroups2 support, so the statement that btrfs is the only fs with that is incorrect. It would help to make things a bit clearer if that list was updated, with the information gathered so far,
Yeah that should be changed.
There's a big gap between having cgroups2 support and it actually working.
Well, that's why dchinnner pushed back on the first patches from FB; there was no way for us or any other filesystem to validate that it worked, or would continue to work.
The thing that I've said consistently is that there's nothing keeping XFS from working with cgroups2, it's just that we (Facebook) haven't tested it, because at the time we were rolling it out it didn't have writeback support.
Even btrfs with writeback support enabled still required a few investigations and follow up work to get everything working properly, because you don't ever know what's going to break until you actually use it. So while XFS technically has support, Btrfs is the only fs that we use cgroup2 with IO isolation in production, so it's the only thing we're comfortable with. XFS may work perfectly fine, but AFAIK nobody has ever tested it or used it in production. Thanks,
The work was done by Christoph and was sponsored and tested by Profihost AG, who presumably use it in production.
-Eric
On Wed, Jul 1, 2020 at 5:49 AM Steven Whitehouse swhiteho@redhat.com wrote:
If the / and /home split is the main issue, then dm-thin might be an alternative solution, and we should check to see if some of the issues listed on the change page have been addressed. I'm copying in Jon for additional comment on that. Are those btrfs benefits which are listed on the change page in priority order?
They are of equal priority, from the perspective of both feature owners and the working group, based on many months of discussion. Individual users definitely have their own priorities, that also vary. There is perhaps an emphasis on solving /home and / free space competition, because it is one of the most pernicious issues that really leaves users on an island to fend for themselves in order to resolve it.
Importantly dm-thin doesn't fix this problem by avoiding it in the first place, which Btrfs does. On dm-thin, the user must still identify which file system is out of space, and grow the file system. Once file systems are either snapshot or over provisioned, the only arbiter of used and free space truth is the thin pool. File system sizes are virtual that currently CLI and GUI apps are unprepared to deal with. And that sets up a prerequisite solution before anything dm-thin based could be used, because having fantasy free space reporting is objectively a UX regression.
The transparent compression feature is perhaps understated. It's configurable per directory and per file. That includes algorithm selection. A future feature includes configurable compression level in the XATTR. The simplest use case is selective compression of high value targets like /usr and flatpaks. Future feature ideas include user selection of directories, and UI that shows compression efficacy.
Reflinks are permitted between Btrfs subvolumes, where neither reflinks nor hard links are possible between dm-thin snapshots. One use case is cheaply restoring individual files from snapshots. Also thin snapshots currently pin the file system journal inside, making them rather expensive in terms of space consumption, compared to Btrfs snapshots. Again, the cost of thin snapshots is only revealed by the thin pool, not the file system. Where as on Btrfs 'df' and friends are expected to properly report free space, and they do.
Also, cgroup2 developers report that the IO isolation features of any file system are lost on anything device-mapper based. And while that work is in progress, it's not there yet.
Integrity checking is highly valued by some and less by others. Considering that we know hardware isn't 100% reliable, and doesn't always report its own failures as expected, and hence why most file systems now at least checksum metadata, it's not persuasive to me that the data should be left unchecked, and corruption ought to be handled by user space somehow.
File system resize is mentioned there, but pretty much all local filesystems support grow. Also, no use cases are listed for that benefit.
Windows and macOS support online shrink and grow for more than a decade. While it doesn't often come up on the desktop, if you don't have it and need it, it's aggravating.
The typical use case today is to reprovision a system with an additional or eventual substitute OS, without first having to destroy another. I'd call it rare. But it's also essentially expected.
The much more common use case is for systemd-homed for managing authentication and user homes, including when encrypted. It's not decided whether to integrate sd-homed but it supports multiple storage types. One of those storage types, LUKS on loop, does effectively depend on file system shrink capability. While the use case for Fedora is mainly single user, and to optimize for that case, it is not exclusively single user so the chosen solution shouldn't cause difficult regressions. And we get a number of free and used space knock on effects here if the file system can't do online shrink. Is lack of online shrink disqualifying? No, but having it significantly improves UX. So whether LUKS or future fscrypt/Btrfs encryption, this road points to Btrfs.
Yes, we could drop LVM and go with one big file system, that too was discussed. The main knock on effect there is a significant minority of users want to do a clean install of Fedora from time to time while preserving user home. The installer permits this behavior with LVM layouts, and Btrfs by only requiring a new root subvolume be created for mounting at /.
It doesn't mean Btrfs applies to 100% of Fedora use cases. No single layout does. But Btrfs consistently solves more problems than causing more knock on effects. This doesn't make the alternatives bad. It just leaves a variety of problems unsolved. That too isn't inherently bad or disqualifying, but it's an opportunity that begs for a more complete picture.
Shrink is more tricky, and can easily result in poor file layouts, particularly if there are repeated grow/shrink operations, not to mention potential complications with NFS if the fs is exported.
A systemd-homed workflow suggests some cases where there will be many grow/shrink operations. If there are two or three active users, this might mean several grow/shrink operations per day. So it would need to be a file system explicitly designed for this in mind, including no negative locality knock on effects. Only Btrfs meets this use case requirement without knock on effects. Its metadata has no fixed layout, its written dynamically so it doesn't suffer from the poor layout problem other file systems do.
Eric has already pointed out that XFS has cgroups2 support, so the statement that btrfs is the only fs with that is incorrect. It would help to make things a bit clearer if that list was updated, with the information gathered so far,
Updated. I've asked a number of cgroups2 kernel developers about this and they've consistently told me that they know Btrfs does it correctly, ext4 has priority inversions, and they don't know about XFS.
Chris Murphy
On 7/1/20 12:50 PM, Chris Murphy wrote:
...
Integrity checking is highly valued by some and less by others. Considering that we know hardware isn't 100% reliable, and doesn't always report its own failures as expected, and hence why most file systems now at least checksum metadata, it's not persuasive to me that the data should be left unchecked, and corruption ought to be handled by user space somehow.
There's a flip side to this coin - in my experience, if the right btrfs metadata blocks experience this disk corruption, there can be a complete inability to recover the btrfs filesystem from that error - i.e. it won't mount, and btrfsck --repair won't get it to a mountable state.
So if we're saying disk corruption happens often enough that data checksumming is critical, then it happens often enough that metadata recovery is at least as critical.
I've been trying to quantify this and have not come up with a particularly compelling test scenario, because it involves purposefully (though at random) corrupting enough blocks on a filesystem image that a critical block gets hit, so it looks synthetic. But the net result is frequently a filesystem where btrfsck and/or mount fails, and at first blush this type of failure happens much more often than on other filesystems.[1]
I think Josef has alluded to this situation as well. To me, that's a big concern. Not trying to be a wet blanket here but I think this needs to be carefully investigated and evaluated to understand what impact it may have on Fedora btrfs users and their ability to recover their data in the face of metadata corruption, because it looks to me like a definite btrfs weak spot.
-Eric
[1] some details - I used the mangle.c fuzzer from fsfuzzer, and modified it so that it corrupts 8192 bytes of an image, which in fs terms can be up to 8192 filesystem blocks. I also avoided the first 4k so that any filesystem signature was not damaged.
I then ran a loop where I created a 1G base image, populated it, fuzzed it in this way, (so up to 3% of blocks were damaged) and ran the filesystem's fsck utility (in btrfs' case, btrfsck --repair) and then tried to mount (in btrfs' case, with bare mount, then -o usebackuproot if mount failed). If it mounted, I used "find | wc" to see how many files were reachable vs the original image.
If either fsck or mount reports an exit code that reflects failure to complete properly, I recorded that.
It was a quick hack, and it's not beautiful, so there are probably holes to be poked in it; if you want to look, I threw the bash script and the C source up at https://people.redhat.com/esandeen/fsckfuzzer/
Running 10 loops on each of btrfs, ext4, and xfs I got results that look like this (ext4 always creates empty lost+found so it will always find at least 1 file there)
btrfs
fsck failed 0 files in lost+found, 628 files gone/unreachable 0 files in lost+found, 0 files gone/unreachable 526 files in lost+found, 9 files gone/unreachable 595 files in lost+found, 55 files gone/unreachable 53 files in lost+found, 8 files gone/unreachable 57 files in lost+found, 44 files gone/unreachable fsck failed 7 files in lost+found, 1491 files gone/unreachable fsck failed, mount failed fsck failed, mount failed 88 files in lost+found, 40 files gone/unreachable == 4 fsck failures, 2 mount failures
ext4
1 files in lost+found, 0 files gone/unreachable 1 files in lost+found, 0 files gone/unreachable 164 files in lost+found, 2 files gone/unreachable 1 files in lost+found, 0 files gone/unreachable 1 files in lost+found, 0 files gone/unreachable 1 files in lost+found, 1 files gone/unreachable 1 files in lost+found, 0 files gone/unreachable 9 files in lost+found, 1 files gone/unreachable 1 files in lost+found, 0 files gone/unreachable 1 files in lost+found, 0 files gone/unreachable == 0 fsck failures, 0 mount failures
xfs
0 files in lost+found, 1 files gone/unreachable 0 files in lost+found, 0 files gone/unreachable 958 files in lost+found, 629 files gone/unreachable 0 files in lost+found, 0 files gone/unreachable 2 files in lost+found, 0 files gone/unreachable 0 files in lost+found, 1 files gone/unreachable 0 files in lost+found, 0 files gone/unreachable 0 files in lost+found, 0 files gone/unreachable 8 files in lost+found, 1 files gone/unreachable 3 files in lost+found, -1 files gone/unreachable == 0 fsck failures, 0 mount failures
On Thursday, 2 July 2020 21.38.46 WEST Eric Sandeen wrote:
3 files in lost+found, -1 files gone/unreachable
This last line from the xfs test seems suspicious (the -1 file gone). :-)
On 7/2/20 3:58 PM, José Abílio Matos wrote:
On Thursday, 2 July 2020 21.38.46 WEST Eric Sandeen wrote:
3 files in lost+found, -1 files gone/unreachable
This last line from the xfs test seems suspicious (the -1 file gone). :-)
It is weird, but it shows I didn't fudge the numbers ;)
directory repair may have inadvertently created a file or something, not sure.
-Eric
On 7/2/20 4:38 PM, Eric Sandeen wrote:
On 7/1/20 12:50 PM, Chris Murphy wrote:
...
Integrity checking is highly valued by some and less by others. Considering that we know hardware isn't 100% reliable, and doesn't always report its own failures as expected, and hence why most file systems now at least checksum metadata, it's not persuasive to me that the data should be left unchecked, and corruption ought to be handled by user space somehow.
There's a flip side to this coin - in my experience, if the right btrfs metadata blocks experience this disk corruption, there can be a complete inability to recover the btrfs filesystem from that error - i.e. it won't mount, and btrfsck --repair won't get it to a mountable state.
So if we're saying disk corruption happens often enough that data checksumming is critical, then it happens often enough that metadata recovery is at least as critical.
I've been trying to quantify this and have not come up with a particularly compelling test scenario, because it involves purposefully (though at random) corrupting enough blocks on a filesystem image that a critical block gets hit, so it looks synthetic. But the net result is frequently a filesystem where btrfsck and/or mount fails, and at first blush this type of failure happens much more often than on other filesystems.[1]
I think Josef has alluded to this situation as well. To me, that's a big concern. Not trying to be a wet blanket here but I think this needs to be carefully investigated and evaluated to understand what impact it may have on Fedora btrfs users and their ability to recover their data in the face of metadata corruption, because it looks to me like a definite btrfs weak spot.
Yeah this is what I've said many times over the last 3 weeks. Btrfs is more vulnerable to metadata corruption.
Now there's things that we can do to mitigate this. I have one patch up to handle one of the main cases (a corrupt global tree). The next patch set will be to keep entire metadata tree's around for longer as long as we have space to handle it. These two things will drastically improve the situation, but of course if I'm being evil we can still end up in a bad spot. These patches are not hard or controversial, they'll likely land in 5.9 which will be what F33 ships with (if I'm doing my math right).
And this sort of ignores the other side of the coin. fsfuzzer isn't just corrupting metadata, it's corrupting data. Btrfs is the only file system that's going to notice that and let the user know.
Checksumming is great because it lets the user know things are going wrong before they go catastrophically wrong. However just because we know something went wrong doesn't mean we can do anything about it, it just means that the user knows now that they need to restore from backups and find a new drive. These features do not mean you are absolved of good practices. If you care about data, you need to have it in multiple places. End of story. Btrfs is just going to let you know in advance that things are going wrong.
We're talking about this issue like it's reasonable that xfs and ext4 are going to allow the user to get back a bunch of data they don't know is ok or not. We're also talking about it like the user should be able to carry on his happy merry way. In these cases the drive is dying and needs to be shredded, and a new install needs to happen and a restore from backups needs to happen. Is the btrfs failure much less user friendly? No doubt about it. Is it any comfort at all when a user shows up and we say "where are your backups" and they say "what backups?", no. But if we're going to talk about this like ext4 and xfs are much better because they give you the _appearance_ that your data is fine, that's a bit disingenuous.
"Well what if it was just /usr." Sure, then you got lucky and you could copy things off. But what if it wasn't? That's the measure that's being applied to btrfs here. Is it likely that random corruption is going to be so bad that you end up with an unmountable file system? It's about as likely that the random corruption is on your dissertation or your family photographs. The difference is that btrfs will tell you that your dissertation or your family photographs are now bad, whereas ext4 and xfs will not.
These are tradeoffs no doubt. Every file system choice is a series of trade offs. We're arguing/optimizing for the narrowest usecase. Arguments can be made either way, but in the end is it important enough to not move ahead with btrfs? Thanks,
Josef
On 7/2/20 4:44 PM, Josef Bacik wrote:
We're talking about this issue like it's reasonable that xfs and ext4 are going to allow the user to get back a bunch of data they don't know is ok or not. We're also talking about it like the user should be able to carry on his happy merry way. In these cases the drive is dying and needs to be shredded, and a new install needs to happen and a restore from backups needs to happen. Is the btrfs failure much less user friendly? No doubt about it. Is it any comfort at all when a user shows up and we say "where are your backups" and they say "what backups?", no. But if we're going to talk about this like ext4 and xfs are much better because they give you the _appearance_ that your data is fine, that's a bit disingenuous.
If I had talked about it like that, it would have been disingenuous.
But I didn't; this was an investigation of resiliency to metadata corruption, not data error detection, and to what degree metadata corruption can render files or even entire filesystems unreachable after normal administrative recovery efforts.
-Eric
Yeah I mean the general discussion, not you specifically. Thanks,
Josef
On Thu, Jul 2, 2020 at 8:38 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/2/20 4:44 PM, Josef Bacik wrote:
We're talking about this issue like it's reasonable that xfs and ext4
are going to allow the user to get back a bunch of data they don't know is ok or not. We're also talking about it like the user should be able to carry on his happy merry way. In these cases the drive is dying and needs to be shredded, and a new install needs to happen and a restore from backups needs to happen. Is the btrfs failure much less user friendly? No doubt about it. Is it any comfort at all when a user shows up and we say "where are your backups" and they say "what backups?", no. But if we're going to talk about this like ext4 and xfs are much better because they give you the _appearance_ that your data is fine, that's a bit disingenuous.
If I had talked about it like that, it would have been disingenuous.
But I didn't; this was an investigation of resiliency to metadata corruption, not data error detection, and to what degree metadata corruption can render files or even entire filesystems unreachable after normal administrative recovery efforts.
-Eric _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Le jeudi 02 juillet 2020 à 17:44 -0400, Josef Bacik a écrit :
However just because we know something went wrong doesn't mean we can do anything about it, it just means that the user knows now that they need to restore from backups
That’s a perfect answer for an Enterprise server setup with systematic backup/restore procedures.
For workstations? Even in an Enterprise context? Not so much.
Regards,
On 7/2/20 4:38 PM, Eric Sandeen wrote:
Running 10 loops on each of btrfs, ext4, and xfs I got results that look like this (ext4 always creates empty lost+found so it will always find at least 1 file there)
btrfs ... == 4 fsck failures, 2 mount failures
ext4 ... == 0 fsck failures, 0 mount failures
xfs ... == 0 fsck failures, 0 mount failures
Did you check the content of the filesystem, to make sure that the files restored by fsck are actually correct?
I think ext4/xfs may be showing 0 files lost but they may or may not contain the pre-damage content, while btrfs would just fess up that it lost them if the checksums didn't agree.
On Wed, 1 Jul 2020 at 07:19, Zbigniew Jędrzejewski-Szmek zbyszek@in.waw.pl wrote:
Yeah. I have no doubt that the decision was made carefully back then. That said, time has passed, and btrfs has evolved and our use cases have evolved too, so a fresh look is good.
We have https://fedoraproject.org/wiki/Changes/DNF_Better_Counting, maybe this could be used to collect some statistics about the fs type too.
I am going to try and nix this one in the bud right here. DNF counting is NOT the place to do this.
Starting to collect such information is a slippery slope and the more you collect the harder it is to remove personal identifiable data. If this sort of data is going to be collected it needs to be done by a specific program that does this, which can be audited, which can be deleted, and which can be 'cleaned' to meet GDPR and other rules. The DNF counting works because all it does is give a 'better guess' on what is going on by randomly burping a countme over a week (if countme is turned on). The data it collects in the end is not absolute but fuzzy.. it is just supposedly better a better fuzzy than the previous guesses.
The other information gathered in that transaction is stuff that already has a business need to work.. our servers need to know what architecture, what release and what ip to send data the appropriate mirrorlist back to.. we also need to keep a log of the transaction to debug why XYZ proxy decided to send the wrong thing or some other issue.
Mirrormanager does not have a need to know what filesystem you are using, it does not need to know what exact CPU, memory amount, or a bunch of other things which would be useful for the project. Even something like the packages you have installed which is sort of closely aligned with a business need that other distros do collect, it does not collect and adding it now would be a headache.
If the project wants this, then someone needs to make a smolt replacement but with some people who truly understand privacy programming well enough to not end up with a landmine field.
On Wed, Jul 01, 2020 at 08:48:57AM -0400, Stephen John Smoogen wrote:
We have https://fedoraproject.org/wiki/Changes/DNF_Better_Counting, maybe this could be used to collect some statistics about the fs type too.
I am going to try and nix this one in the bud right here. DNF counting is NOT the place to do this.
Yeah -- that feature is explicitly limited. I know Christian is interested in an system-information-collection system developed by Endless Computing as presented at GUADEC ... was that just last year? Sometime. (What is time anyway?)
Le mercredi 01 juillet 2020 à 11:09 +0000, Zbigniew Jędrzejewski-Szmek a écrit :
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
Regards,
On Wed, Jul 1, 2020 at 10:26 AM Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Le mercredi 01 juillet 2020 à 11:09 +0000, Zbigniew Jędrzejewski-Szmek a écrit :
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
Anaconda does this behavior correctly with btrfs with / and /home as separate subvolumes on the same btrfs volume. So the preservation of /home would still work, while we get flexible storage allocation at the volume level.
Le mercredi 01 juillet 2020 à 10:27 -0400, Neal Gompa a écrit :
On Wed, Jul 1, 2020 at 10:26 AM Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Le mercredi 01 juillet 2020 à 11:09 +0000, Zbigniew Jędrzejewski- Szmek a écrit :
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
Anaconda does this behavior correctly with btrfs with / and /home as separate subvolumes on the same btrfs volume. So the preservation of /home would still work, while we get flexible storage allocation at the volume level.
That only works as long as btrfs is the next shiny thing, or as long as no one decides the options Fedora used to create btrfs volumes with are crap and it’s a good idea to recreate them with new better options.
(btrfs may offer migration to new volume options like ext4 did in the past, I still would not want anaconda to touch my existing user data volumes, especially when doing emergency reinstalls because the kernel, systemd, glibc or any other core component crapped itself).
Regards,
On Wed, Jul 01, 2020 at 04:25:31PM +0200, Nicolas Mailhot via devel wrote:
Le mercredi 01 juillet 2020 à 11:09 +0000, Zbigniew Jędrzejewski-Szmek a écrit :
Actually that part has been answered pretty comprehensively. The split between / and /home is hurting users
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
The whole point of the btrfs change is to keep / and /home on separate subvolumes to avoid the anaconda requirement to reformat / from affecting /home while also avoiding the problem of running out of space on one while still having tons of free space on the other.
On Wed, Jul 1, 2020 at 4:25 pm, Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
So for the avoidance of doubt: if the btrfs change is rejected, we are almost certain to put everything on the same mount point. We haven't approved this yet, but odds are very high IMO. The options we are seriously considering for our default going forward are (a) btrfs, (b) failing that, probably ext4 all one big partition without LVM, (c) less-likely, maybe xfs all one big partition without LVM. This is being discussed in https://pagure.io/fedora-workstation/issue/152
We have a high number of complaints from developers running out of space on / with plenty of space left on /home (happens to me all the time). The opposite scenario is a problem too. Separate mountpoints by default is just not a good default, sorry. Ensuring users don't run out of space due to bad partitioning is more important than keeping /home during reinstall IMO. But with btrfs, then /home will just be a subvolume so we can have our cake and eat it too.
On 2020-07-01 18:53, Michael Catanzaro wrote:
The options we are seriously considering for our default going forward are (a) btrfs, (b) failing that, probably ext4 all one big partition without LVM, (c) less-likely, maybe xfs all one big partition without LVM. This is being discussed in https://pagure.io/fedora-workstation/issue/152
One partition without LVM? Maybe labeling this partition C:\
The real solution would be to make wise usage of LVM, for example by not allocating 100% of the extents at the beginning (or even dm-thin) and/or using filesystems where a shrink is supported (I'm here blaming xfs for not having this, while ext4 has).
On Wed, Jul 1, 2020 at 11:01 pm, Roberto Ragusa mail@robertoragusa.it wrote:
The real solution would be to make wise usage of LVM, for example by not allocating 100% of the extents at the beginning (or even dm-thin) and/or using filesystems where a shrink is supported (I'm here blaming xfs for not having this, while ext4 has).
Leaving space unallocated doesn't gain us anything because the user still has to manually resize both logical volumes and the partitions inside them. Our default needs to be something that doesn't require users to resize partitions.
On 2020-07-01 23:04, Michael Catanzaro wrote:
On Wed, Jul 1, 2020 at 11:01 pm, Roberto Ragusa mail@robertoragusa.it wrote:
The real solution would be to make wise usage of LVM, for example by not allocating 100% of the extents at the beginning (or even dm-thin) and/or using filesystems where a shrink is supported (I'm here blaming xfs for not having this, while ext4 has).
Leaving space unallocated doesn't gain us anything because the user still has to manually resize both logical volumes and the partitions inside them. Our default needs to be something that doesn't require users to resize partitions.
But those are things that can be done in a few seconds with one or two commands. Attempts to make easy things easier lead to making other things difficult: some not so inexperienced users will find themselves with their disk having only one big partition, no LVM, everything inside (system+data) and trying to decipher the suggestion found on a forum "with btrfs you can sort of format / without losing /home even if you do not have separate partitions".
On 7/1/20 11:53 AM, Michael Catanzaro wrote:
On Wed, Jul 1, 2020 at 4:25 pm, Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
So for the avoidance of doubt: if the btrfs change is rejected, we are almost certain to put everything on the same mount point. We haven't approved this yet, but odds are very high IMO. The options we are seriously considering for our default going forward are (a) btrfs, (b) failing that, probably ext4 all one big partition without LVM, (c) less-likely, maybe xfs all one big partition without LVM. This is being discussed in https://pagure.io/fedora-workstation/issue/152
We have a high number of complaints from developers running out of space on / with plenty of space left on /home (happens to me all the time). The opposite scenario is a problem too. Separate mountpoints by default is just not a good default, sorry. Ensuring users don't run out of space due to bad partitioning is more important than keeping /home during reinstall IMO. But with btrfs, then /home will just be a subvolume so we can have our cake and eat it too.
This can be mitigated with directory (project) quotas, btw.
On XFS, exceeding a directory tree quota even yields ENOSPC. (on ext4, it's EDQUOT right now.)
So one big / partition including /home, with a directory quota set on /home at 20G, will yield ENOSPC when home contains 20G and will now allow / to get filled with user files.
It's also trivial to adjust the directory quota on /home up or down, as needed.
It's another cake eating-and-having option which is a pretty trivial thing to implement.
-Eric
On Wed, Jul 1, 2020 at 5:06 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/1/20 11:53 AM, Michael Catanzaro wrote:
On Wed, Jul 1, 2020 at 4:25 pm, Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
So for the avoidance of doubt: if the btrfs change is rejected, we are almost certain to put everything on the same mount point. We haven't approved this yet, but odds are very high IMO. The options we are seriously considering for our default going forward are (a) btrfs, (b) failing that, probably ext4 all one big partition without LVM, (c) less-likely, maybe xfs all one big partition without LVM. This is being discussed in https://pagure.io/fedora-workstation/issue/152
We have a high number of complaints from developers running out of space on / with plenty of space left on /home (happens to me all the time). The opposite scenario is a problem too. Separate mountpoints by default is just not a good default, sorry. Ensuring users don't run out of space due to bad partitioning is more important than keeping /home during reinstall IMO. But with btrfs, then /home will just be a subvolume so we can have our cake and eat it too.
This can be mitigated with directory (project) quotas, btw.
On XFS, exceeding a directory tree quota even yields ENOSPC. (on ext4, it's EDQUOT right now.)
So one big / partition including /home, with a directory quota set on /home at 20G, will yield ENOSPC when home contains 20G and will now allow / to get filled with user files.
It's also trivial to adjust the directory quota on /home up or down, as needed.
It's another cake eating-and-having option which is a pretty trivial thing to implement.
This does not solve the "Anaconda will blow away /home because it's technically part of /" problem, though. Btrfs subvolumes do.
Directory quotas only protect against space contention, and while Btrfs quotas do the same thing, we're deliberately not proposing setting those up because we want space allocation to be flexible.
On 7/1/20 4:08 PM, Neal Gompa wrote:
On Wed, Jul 1, 2020 at 5:06 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/1/20 11:53 AM, Michael Catanzaro wrote:
On Wed, Jul 1, 2020 at 4:25 pm, Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Actually this split is a godsend because you can convince anaconda to leave your home alone when reinstalling, while someone always seems too invent a new Fedora change that justifies the reformatting of /.
Good luck dealing with user data the next time workstation (or any other group) feels the / filesystem should change, once you've put user data on the same mount point
So for the avoidance of doubt: if the btrfs change is rejected, we are almost certain to put everything on the same mount point. We haven't approved this yet, but odds are very high IMO. The options we are seriously considering for our default going forward are (a) btrfs, (b) failing that, probably ext4 all one big partition without LVM, (c) less-likely, maybe xfs all one big partition without LVM. This is being discussed in https://pagure.io/fedora-workstation/issue/152
We have a high number of complaints from developers running out of space on / with plenty of space left on /home (happens to me all the time). The opposite scenario is a problem too. Separate mountpoints by default is just not a good default, sorry. Ensuring users don't run out of space due to bad partitioning is more important than keeping /home during reinstall IMO. But with btrfs, then /home will just be a subvolume so we can have our cake and eat it too.
This can be mitigated with directory (project) quotas, btw.
On XFS, exceeding a directory tree quota even yields ENOSPC. (on ext4, it's EDQUOT right now.)
So one big / partition including /home, with a directory quota set on /home at 20G, will yield ENOSPC when home contains 20G and will now allow / to get filled with user files.
It's also trivial to adjust the directory quota on /home up or down, as needed.
It's another cake eating-and-having option which is a pretty trivial thing to implement.
This does not solve the "Anaconda will blow away /home because it's technically part of /" problem, though. Btrfs subvolumes do.
Directory quotas only protect against space contention, and while Btrfs quotas do the same thing, we're deliberately not proposing setting those up because we want space allocation to be flexible.
I was not proposing directory quotas as any protection against mkfs of the root device, of course. Changing that behavior in Anaconda would be another rather minor change as well, i.e. the equivalent of "rm -rf /usr /var/ ..." instead of mkfs at reinstall time.
-Eric
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
Given the number of Fedora desktop users, even an increase of 0.1% in now-I-can't-boot situations would be a catastrophe. Is that a risk? I literally don't know. Maybe it's not -- but we've worked hard to get Fedora a reputation of being problem-free and something that leads without being "bleeding edge". It's a tricky balance.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Exactly.
Maybe we could add an "Automatically configure with btrfs (experimental)" option to the Installation Destination screen, and then feature that in Fedora Magazine and schedule a number of test days?
To be clear, I'm not suggesting this as a blocking tactic. The assumption would be that we'd go ahead with flipping the defaults (as you say above) for F34 unless the results come back in a way that gives us pause.
On 1 July 2020 20:24:37 CEST, Matthew Miller mattdm@fedoraproject.org wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
Given the number of Fedora desktop users, even an increase of 0.1% in now-I-can't-boot situations would be a catastrophe. Is that a risk? I literally don't know. Maybe it's not -- but we've worked hard to get Fedora a reputation of being problem-free and something that leads without being "bleeding edge". It's a tricky balance.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Exactly.
Maybe we could add an "Automatically configure with btrfs (experimental)" option to the Installation Destination screen, and then feature that in Fedora Magazine and schedule a number of test days?
To be clear, I'm not suggesting this as a blocking tactic. The assumption would be that we'd go ahead with flipping the defaults (as you say above) for F34 unless the results come back in a way that gives us pause.
This is pretty much exactly how I would like this to happen. It has a schedule so it doesn't just slip while still being as cautious as one should be about fs changes. The only way that would make it even better is a clear definition of what severity of problem is needed to not implement as default in F34 and what happens then. This to avoid the inevitable discussion before F34. With this plan I have no problems.
I like this approach, a lot. I'm all in favour of switching to btrfs (I've been using it for a while, on server & desktop), and I think this would be a safe approach to do so.
Christopher
On 01.07.20 20:24, Matthew Miller wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
Given the number of Fedora desktop users, even an increase of 0.1% in now-I-can't-boot situations would be a catastrophe. Is that a risk? I literally don't know. Maybe it's not -- but we've worked hard to get Fedora a reputation of being problem-free and something that leads without being "bleeding edge". It's a tricky balance.
Normally we just switch the default or we don't, without half measures. But the fs is important enough and complicated enough to be extra careful about any transitions.
Exactly.
Maybe we could add an "Automatically configure with btrfs (experimental)" option to the Installation Destination screen, and then feature that in Fedora Magazine and schedule a number of test days?
To be clear, I'm not suggesting this as a blocking tactic. The assumption would be that we'd go ahead with flipping the defaults (as you say above) for F34 unless the results come back in a way that gives us pause.
On 7/1/20 2:24 PM, Matthew Miller wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
There's only so much we can do about this. I've sent up patches to ignore failed global trees to allow users to more easily recover data in case of corruption in the case of global trees, but as they say if only 1 bit is off in a node, we throw the whole node away. And throwing a node away means you lose access to any of its children, which could be a large chunk of the file system.
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC. We don't know _which_ bits are fucked, we just know somethings fucked, so we throw it all away. If you have RAID or DUP then we go read the other copy, and fix the broken copy if we find a good copy. If we don't, well then there's nothing really we can do.
As for their complaint about DIR_INDEX vs DIR_ITEM recovery, that's been around for a while now. A lot of these things have been added over the last year.
Another thing to keep in mind is that fsck is _very_ conservative for a reason. It's only job is to get the fs back to the point that it can be mounted, it has no knowledge of what data is important and which is not. So by default it doesn't do much, because we want the user to be able to use the rescue tools to pull off any data they can before they run repair. Because it's possible that fsck decides to delete problematic entries, and maybe those entries are to data you cared about.
I've stated this many times before, btrfs is more vulnerable to things going wrong. It's also more likely to notice things going wrong. There's things we can do to make it easier in the face of these issues, they're patches I've written and submitted in the last few days. There's bigger, more complex things that I can do to make us more resilient in the face of these corruptions. But even with all of the things I have in my head, I could still go do one or two things and render the file system unusable. Would these things happen in practice? Unlikely. Is it impossible? Unfortunately no. Thanks,
Josef
On 7/1/20 3:50 PM, Josef Bacik wrote:
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC.
Yes, exactly---why isn't it ECC? Wouldn't it work better, especially in the context of faulty hardware?
I do realize it would require changing the on-disk format, and maybe slow the critical path...
On Wed, Jul 1, 2020, at 9:03 PM, Przemek Klosowski via devel wrote:
On 7/1/20 3:50 PM, Josef Bacik wrote:
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC.
Yes, exactly---why isn't it ECC? Wouldn't it work better, especially in the context of faulty hardware?
I do realize it would require changing the on-disk format, and maybe slow the critical path...
Or maybe make all metadata raid 1, even on single disk set up?
V/r, James Cassell
On Wed, Jul 1, 2020 at 9:27 PM James Cassell fedoraproject@cyberpear.com wrote:
On Wed, Jul 1, 2020, at 9:03 PM, Przemek Klosowski via devel wrote:
On 7/1/20 3:50 PM, Josef Bacik wrote:
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC.
Yes, exactly---why isn't it ECC? Wouldn't it work better, especially in the context of faulty hardware?
I do realize it would require changing the on-disk format, and maybe slow the critical path...
Or maybe make all metadata raid 1, even on single disk set up?
Not that isn't interesting, but what would be the mirror target on a single disk setup?
On Wed, Jul 1, 2020, at 9:43 PM, Neal Gompa wrote:
On Wed, Jul 1, 2020 at 9:27 PM James Cassell fedoraproject@cyberpear.com wrote:
On Wed, Jul 1, 2020, at 9:03 PM, Przemek Klosowski via devel wrote:
On 7/1/20 3:50 PM, Josef Bacik wrote:
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC.
Yes, exactly---why isn't it ECC? Wouldn't it work better, especially in the context of faulty hardware?
I do realize it would require changing the on-disk format, and maybe slow the critical path...
Or maybe make all metadata raid 1, even on single disk set up?
Not that isn't interesting, but what would be the mirror target on a single disk setup?
The idea is that the second copy of metadata on the same disk might be readable in case the first copy has a checksum error, in case of fault hardware. I haven't tried it, but I'd gladly give up a little space for more robustness, especially if btrfs is sensitive to metadata corruption by the hardware. If btrfs demands a separate device for raid1 metadata, I wonder if a small 1G partition could be dedicated for purely mirrored metadata use.
V/r, James Cassell
On Wed, Jul 1, 2020 at 8:24 PM James Cassell fedoraproject@cyberpear.com wrote:
On Wed, Jul 1, 2020, at 9:43 PM, Neal Gompa wrote:
On Wed, Jul 1, 2020 at 9:27 PM James Cassell
Or maybe make all metadata raid 1, even on single disk set up?
Not that isn't interesting, but what would be the mirror target on a single disk setup?
The idea is that the second copy of metadata on the same disk might be readable in case the first copy has a checksum error, in case of fault hardware. I haven't tried it, but I'd gladly give up a little space for more robustness, especially if btrfs is sensitive to metadata corruption by the hardware. If btrfs demands a separate device for raid1 metadata, I wonder if a small 1G partition could be dedicated for purely mirrored metadata use.
This is called 'dup' profile in Btrfs. Two copies of a block group. It can be set on metadata only, or both metadata and data block groups. It is the default mkfs option for HDDs. It is not enabled by default on SSDs because concurrent writes of metadata i.e. they happen essentially at the exact same time, means the data is likely to end up on the same erase block, and typical corruptions affect the whole block so it's widely considered to be pointless to use dup on flash media. You can use it anyway, either with mkfs, or by converting the block group from the single profile to dup. This is a safe procedure.
On Wed, Jul 1, 2020 at 11:25 PM Chris Murphy lists@colorremedies.com wrote:
This is called 'dup' profile in Btrfs. Two copies of a block group.
Two copies of a block group ^on the same drive.
Chris Murphy wrote on Wed, Jul 01, 2020:
This is called 'dup' profile in Btrfs. Two copies of a block group. It can be set on metadata only, or both metadata and data block groups. It is the default mkfs option for HDDs. It is not enabled by default on SSDs because concurrent writes of metadata i.e. they happen essentially at the exact same time, means the data is likely to end up on the same erase block, and typical corruptions affect the whole block so it's widely considered to be pointless to use dup on flash media. You can use it anyway, either with mkfs, or by converting the block group from the single profile to dup. This is a safe procedure.
Does anyone know if anything in the nvme spec says that creating two namespaces should or could prevent coalescing IO like this? perhaps is the blocksize is different?
(this doesn't really help with default setup case, but it could make sense to split the disk in two with data single + metadata raid1 over a nvme namespace for people who can bother creating one. Unfortunately nvme namespaces are rather messy and I don't think autopartitionning tools should mess with that, but having a raid just for metadata is one of btrfs' strength so it's a shame to pass on it... Alternatively it would require something like async copyback of the second metadata copy but that in itself has a lot of other problems and don't really look like an option)
Once upon a time, Josef Bacik josef@toxicpanda.com said:
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC. We don't know _which_ bits are fucked, we just know somethings fucked, so we throw it all away. If you have RAID or DUP then we go read the other copy, and fix the broken copy if we find a good copy. If we don't, well then there's nothing really we can do.
That's where an fsck and a lost+found type directory should come into play. Maybe punt to user space, but still try to see what you can make sense of to try to salvage. If you are saying a single bit error in the wrong place can basically lop off a good chunk of a filesystem, then I'm going to say that's not an improvement in reliability.
On 7/1/20 9:49 PM, Chris Adams wrote:
Once upon a time, Josef Bacik josef@toxicpanda.com said:
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC. We don't know _which_ bits are fucked, we just know somethings fucked, so we throw it all away. If you have RAID or DUP then we go read the other copy, and fix the broken copy if we find a good copy. If we don't, well then there's nothing really we can do.
That's where an fsck and a lost+found type directory should come into play. Maybe punt to user space, but still try to see what you can make sense of to try to salvage. If you are saying a single bit error in the wrong place can basically lop off a good chunk of a filesystem, then I'm going to say that's not an improvement in reliability.
We do, the recovery tools allow you to just ignore checksums. This is specifically separate from everything else because there's the expectation of results. The user is acknowledging that things are bad and the tools are going to do their very best. If you know you only have a single bit off then hooray, you got everything back (probably), but if not then you don't. Thanks,
Josef
On 7/1/20 2:50 PM, Josef Bacik wrote:
On 7/1/20 2:24 PM, Matthew Miller wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
There's only so much we can do about this. I've sent up patches to ignore failed global trees to allow users to more easily recover data in case of corruption in the case of global trees, but as they say if only 1 bit is off in a node, we throw the whole node away. And throwing a node away means you lose access to any of its children, which could be a large chunk of the file system.
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC. We don't know _which_ bits are fucked, we just know somethings fucked, so we throw it all away. If you have RAID or DUP then we go read the other copy, and fix the broken copy if we find a good copy. If we don't, well then there's nothing really we can do.
There is often a path forward when a bad metadata checksum is detected. i.e. e2fsck:
scan_extent_node() { ...
/* Failed csum but passes checks? Ask to fix checksum. */ if (failed_csum && fix_problem(ctx, PR_1_EXTENT_ONLY_CSUM_INVALID, pctx)) { pb->inode_modified = 1; pctx->errcode = ext2fs_extent_replace(ehandle, 0, &extent); if (pctx->errcode) return; }
it does similarly for many types of metadata:
/* inode passes checks, but checksum does not match inode */ #define PR_1_INODE_ONLY_CSUM_INVALID 0x010068 -- /* Inode extent block passes checks, but checksum does not match extent */ #define PR_1_EXTENT_ONLY_CSUM_INVALID 0x01006A -- /* Inode extended attribute block passes checks, but checksum does not * match block. */ #define PR_1_EA_BLOCK_ONLY_CSUM_INVALID 0x01006C -- /* dir leaf node passes checks, but fails checksum */ #define PR_2_LEAF_NODE_ONLY_CSUM_INVALID 0x02004D
Does btrfsck really never attempt to salvage a metadata block with a bad CRC by validating its fields?
-Eric
On 7/3/20 9:37 AM, Eric Sandeen wrote:
On 7/1/20 2:50 PM, Josef Bacik wrote:
On 7/1/20 2:24 PM, Matthew Miller wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
There's only so much we can do about this. I've sent up patches to ignore failed global trees to allow users to more easily recover data in case of corruption in the case of global trees, but as they say if only 1 bit is off in a node, we throw the whole node away. And throwing a node away means you lose access to any of its children, which could be a large chunk of the file system.
This sounds like a "wtf, why are you doing this btrfs?" sort of thing, but this is just the reality of using checksums. It's a checksum, not ECC. We don't know _which_ bits are fucked, we just know somethings fucked, so we throw it all away. If you have RAID or DUP then we go read the other copy, and fix the broken copy if we find a good copy. If we don't, well then there's nothing really we can do.
There is often a path forward when a bad metadata checksum is detected. i.e. e2fsck:
scan_extent_node() { ...
/* Failed csum but passes checks? Ask to fix checksum. */ if (failed_csum && fix_problem(ctx, PR_1_EXTENT_ONLY_CSUM_INVALID, pctx)) { pb->inode_modified = 1; pctx->errcode = ext2fs_extent_replace(ehandle, 0, &extent); if (pctx->errcode) return; }it does similarly for many types of metadata:
/* inode passes checks, but checksum does not match inode */
#define PR_1_INODE_ONLY_CSUM_INVALID 0x010068
/* Inode extent block passes checks, but checksum does not match extent */
#define PR_1_EXTENT_ONLY_CSUM_INVALID 0x01006A
/* Inode extended attribute block passes checks, but checksum does not
- match block. */
#define PR_1_EA_BLOCK_ONLY_CSUM_INVALID 0x01006C
/* dir leaf node passes checks, but fails checksum */ #define PR_2_LEAF_NODE_ONLY_CSUM_INVALID 0x02004D
Does btrfsck really never attempt to salvage a metadata block with a bad CRC by validating its fields?
No, I suppose we could, I'll add it to the list. Generally speaking if there's a bad checksum detected we just attempt to recover based on what we couldn't get access to. However that's difficult if it's a node. If it's a leaf then usually you just lose some metadata that can be inferred from other data. For example if you lose a leaf in the extent tree, well we can add all that information back once we've scanned the rest of the file system and know what extents are missing in the extent tree.
Same goes for directory items, we detect that we are missing directory items, but we have references for them and so we add the missing directory items that were lost from that corrupt block.
But again, if you lose a node you lose access to many leaves, which makes it more likely we'll lose somehting because we'll lose the other information we can use to recover what was lost. The extent tree and checksum trees are exceptions to this, since they can be rebuilt from scratch, provided everything else is fine.
And then if we did decide to validate nodes, we _might_ be ok, but we might end up with old versions of leaves because it happens to point at something that appears to be correct, but isn't really. Our metadata changes all the time, so it's not outside the realm of possiblities that the corruption points at a seemlingly valid piece of metadata, but isn't and thus makes us do something _really_ wrong. Thanks,
Josef
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
I ran 50 loops, and got:
46 btrfsck failures 20 mount failures
So it ran btrfs restore 20 times; of those, 11 runs lost all or substantially all of the files; 17 runs lost at least 1/3 of the files.
-Eric
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
What's the probability of this kind of corruption occurring in the real world? If the probability is so low it can't practically be computed, how do we assess the risk? And if we can't assess risk, what's the basis of concern?
I ran 50 loops, and got:
46 btrfsck failures 20 mount failures
So it ran btrfs restore 20 times; of those, 11 runs lost all or substantially all of the files; 17 runs lost at least 1/3 of the files.
Josef states reliability of ext4, xfs, and Btrfs are in the same ballpark. He also reports one case in 10 years in which he failed to recover anything. How do you square that with 11 complete failures, trivially produced? Is there even a reason to suspect there's residual risk?
When metadata is single profile, Btrfs is basically an early warning system. The available research on uncorrectable errors, errors that drive ECC does not catch, suggests that users are decently likely to experience at least one block of corruption in the life of the drive. And that it tends to get worse up until drive failure. But there is much less chance to detect this, if the file system isn't also checksumming the vastly larger payload on a drive: the data.
-- Chris Murphy
On Mon, 6 Jul 2020 at 01:19, Chris Murphy lists@colorremedies.com wrote:
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
What's the probability of this kind of corruption occurring in the real world? If the probability is so low it can't practically be computed, how do we assess the risk? And if we can't assess risk, what's the basis of concern?
Aren't most disk failure tests 'huh it somehow happened at least once and I think this explains all these other failures too?' I know that with giant clusters you can do more testing but you also have a lot of things like
What is the chance that a disk will die over time? 100% What is the chance that a disk died from this particular scenario? 0.00000<maybe put a digit here> % reword the question slightly differently.. What is the chance this disk died from that scenario? 100%.
For the HPC computers we had a score of Phd staticians coming up with all kinds of papers on disk failure modes which if asked in one way would come up with practically 0% odds it would happen. However all of the disk failures had happened at least once over a time frame... sometimes a short one, sometimes a long one, sometimes so often that someone had to retract a paper because it was clear that while the maths said it shouldn't happen .. it did in real life. <welcome to HPC at high altitudes.. cosmic rays, low air pressure, and dry air need to be factored in>
On Mon, Jul 6, 2020 at 9:52 AM Stephen John Smoogen smooge@gmail.com wrote:
On Mon, 6 Jul 2020 at 01:19, Chris Murphy lists@colorremedies.com wrote:
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
What's the probability of this kind of corruption occurring in the real world? If the probability is so low it can't practically be computed, how do we assess the risk? And if we can't assess risk, what's the basis of concern?
Aren't most disk failure tests 'huh it somehow happened at least once and I think this explains all these other failures too?' I know that with giant clusters you can do more testing but you also have a lot of things like
What is the chance that a disk will die over time? 100% What is the chance that a disk died from this particular scenario? 0.00000<maybe put a digit here> % reword the question slightly differently.. What is the chance this disk died from that scenario? 100%.
Yes. Also in fuzzing there is the concept of "when to stop fuzzing" because it's a rabbit hole, you have to come up for air at some point, and work on other things. But you raise a good and subtle point which is also that ext4 has a very good fsck built up over decades, they succeed today from past failures. It's no different with Btrfs.
But also there is a bias. ext4 needs fsck to succeed in the worst cases in order to mount the file system. Btrfs doesn't need that. Often it can tolerate a read-only mount without any other mount option; and optionally can be made more tolerant to errors while still mounting read-only. This is a significant difference in recovery strategy. An fsck is something of a risk because it is writing changes to the file system. It is irreversible. Btrfs takes a different view, which is to increase the chance of recovery without needing a risky repair as the first step. Once your important data is out, now try the repair. Good chance it works, but maybe not as good as ext4's.
-- Chris Murphy
On 7/6/20 8:21 PM, Chris Murphy wrote:
...
Yes. Also in fuzzing there is the concept of "when to stop fuzzing" because it's a rabbit hole, you have to come up for air at some point, and work on other things. But you raise a good and subtle point which is also that ext4 has a very good fsck built up over decades, they succeed today from past failures. It's no different with Btrfs.
But also there is a bias. ext4 needs fsck to succeed in the worst cases in order to mount the file system.
Really?
Btrfs doesn't need that. Often it can tolerate a read-only mount without any other mount option;
Well, this assertion can be tested, so let's do that as well; I'll do 50 runs of:
* mkfs w/ -m single as would happen on SSD * fuzz 2048 byte of that 1G image at random * mount -o ro, tally mount failures * count missing/unreachable files if mount -o ro succeeds
<50 runs later on btrfs>
16 readonly mounts failed (32% failure rate) Within the successful mounts, 1 or more files were unreachable in 30 attempts. Across all 50 attempts, 7720 files were lost.
Is that better than ext4, and will ext4 need fsck just to be able to mount?
<50 runs later on ext4, same strategy>
zero mount failures for ext4. Within the successful mounts, 1 or more files were unreachable in 2 attempts. Across all 50 attempts, 48 files were lost.
It does not seem that btrfs has any unique or superior mount -o ro recovery capabilities, either.
and optionally can be made more tolerant to errors while still mounting read-only. This is a significant difference in recovery strategy. An fsck is something of a risk because it is writing changes to the file system. It is irreversible. Btrfs takes a different view, which is to increase the chance of recovery without needing a risky repair as the first step. Once your important data is out, now try the repair. Good chance it works, but maybe not as good as ext4's.
That's not supported by any of these test results.
-Eric
On 7/9/20 2:24 PM, Eric Sandeen wrote:
<50 runs later on btrfs>
16 readonly mounts failed (32% failure rate) Within the successful mounts, 1 or more files were unreachable in 30 attempts. Across all 50 attempts, 7720 files were lost.
Is that better than ext4, and will ext4 need fsck just to be able to mount?
<50 runs later on ext4, same strategy>
zero mount failures for ext4. Within the successful mounts, 1 or more files were unreachable in 2 attempts. Across all 50 attempts, 48 files were lost.
But for that test to be meaningful, you need to check that the files that ext4 recovers are actually what you expect---after all, if the metadata is damaged and repaired incorrectly, it could point to some random blocks and we'd never know. This is not just theoretical concern---I have seen this type of damage in fsck'ed systems, although I admit it has been long ago. The type of damage might be tricky---for instance part of the file would be correct, but other parts would be wrong, or the file would be truncated.
Btrfs will just give up if it screws up. You could see it as good or bad---after all, if a disk holding your pictures went bad, maybe it is useful to see partially damaged pictures, rather than having the filesystem throw up its hands. On the other hand, btrfs being harsh like that basically sends the message to 'backup or else', which may be the right thing in the end.
On 7/6/20 12:07 AM, Chris Murphy wrote:
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
What's the probability of this kind of corruption occurring in the real world? If the probability is so low it can't practically be computed, how do we assess the risk? And if we can't assess risk, what's the basis of concern?
From 20 years of filesystem development experience, I know that people run filesystem repair tools. It's just a fact. For a wide variety of reasons - from bugs, to hardware errors, to admin errors, you name it, filesystems experience corruption and inconsistencies. At that point the administrator needs a path forward.
"people won't need to repair btrfs" is, IMHO, the position that needs to be supported, not "filesystem repair tools should be robust."
I ran 50 loops, and got:
46 btrfsck failures 20 mount failures
So it ran btrfs restore 20 times; of those, 11 runs lost all or substantially all of the files; 17 runs lost at least 1/3 of the files.
Josef states reliability of ext4, xfs, and Btrfs are in the same ballpark. He also reports one case in 10 years in which he failed to recover anything. How do you square that with 11 complete failures, trivially produced? Is there even a reason to suspect there's residual risk?
Extrapolating from Facebook's usecases to the fedora desktop should be approached with caution, IMHO.
I've provided evidence that if/when damage happens for whatever reason, btrfs is unable to recover in place far more often than other filesytems.
When metadata is single profile, Btrfs is basically an early warning system.> The available research on uncorrectable errors, errors that drive ECC does not catch, suggests that users are decently likely to experience at least one block of corruption in the life of the drive. And that it tends to get worse up until drive failure. But there is much less chance to detect this, if the file system isn't also checksumming the vastly larger payload on a drive: the data.
One of the problems in this whole discussion is the assumption that filesystem inconsistencies only arise from disk bitflips etc; that's just not the case.
Look, I'm just providing evidence of what I've found when re-evaluating the btrfs administration/repair tools. I've found them to be quite weak.
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
-Eric
On 7/9/20 1:51 PM, Eric Sandeen wrote:
On 7/6/20 12:07 AM, Chris Murphy wrote:
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
What's the probability of this kind of corruption occurring in the real world? If the probability is so low it can't practically be computed, how do we assess the risk? And if we can't assess risk, what's the basis of concern?
From 20 years of filesystem development experience, I know that people run filesystem repair tools. It's just a fact. For a wide variety of reasons - from bugs, to hardware errors, to admin errors, you name it, filesystems experience corruption and inconsistencies. At that point the administrator needs a path forward.
"people won't need to repair btrfs" is, IMHO, the position that needs to be supported, not "filesystem repair tools should be robust."
I ran 50 loops, and got:
46 btrfsck failures 20 mount failures
So it ran btrfs restore 20 times; of those, 11 runs lost all or substantially all of the files; 17 runs lost at least 1/3 of the files.
Josef states reliability of ext4, xfs, and Btrfs are in the same ballpark. He also reports one case in 10 years in which he failed to recover anything. How do you square that with 11 complete failures, trivially produced? Is there even a reason to suspect there's residual risk?
Extrapolating from Facebook's usecases to the fedora desktop should be approached with caution, IMHO.
I've provided evidence that if/when damage happens for whatever reason, btrfs is unable to recover in place far more often than other filesytems.
When metadata is single profile, Btrfs is basically an early warning system.> The available research on uncorrectable errors, errors that drive ECC does not catch, suggests that users are decently likely to experience at least one block of corruption in the life of the drive. And that it tends to get worse up until drive failure. But there is much less chance to detect this, if the file system isn't also checksumming the vastly larger payload on a drive: the data.
One of the problems in this whole discussion is the assumption that filesystem inconsistencies only arise from disk bitflips etc; that's just not the case.
Look, I'm just providing evidence of what I've found when re-evaluating the btrfs administration/repair tools. I've found them to be quite weak.
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
Facebook does however have that data, and it's a microscopically small percentage. I agree that Facebook is vastly different from Fedora from a recovery standpoint, but our workloads and hardware I think extrapolate to the normal Fedora user quite well. We drive the disks harder than the normal Fedora user does of course, but in the end we're updating packages, taking snapshots, and building code. We're just doing it at 1000x what a normal Fedora user does. Thanks,
Josef
On 7/9/20 2:11 PM, Josef Bacik wrote:
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
But again, let me reiterate that disk failures are far from the only reason that admins need capable filesystem repair tools, in general.
We see users running fsck all the time, for various reasons. I can't back it up, but my hunch is that bugs and misconfigurations (i.e. write cache) are more often the root cause for filesystem inconsistencies.
IMHO, focusing on physical disk failure rates is focusing too narrowly, but I suppose I'm just joining the chorus of hunches and anecdotes now.
-Eric
Facebook does however have that data, and it's a microscopically small percentage. I agree that Facebook is vastly different from Fedora from a recovery standpoint, but our workloads and hardware I think extrapolate to the normal Fedora user quite well. We drive the disks harder than the normal Fedora user does of course, but in the end we're updating packages, taking snapshots, and building code. We're just doing it at 1000x what a normal Fedora user does. Thanks,
Josef
On Thu, 2020-07-09 at 12:56 -0700, Eric Sandeen wrote:
On 7/9/20 2:11 PM, Josef Bacik wrote:
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
But again, let me reiterate that disk failures are far from the only reason that admins need capable filesystem repair tools, in general.
We see users running fsck all the time, for various reasons. I can't back it up, but my hunch is that bugs and misconfigurations (i.e. write cache) are more often the root cause for filesystem inconsistencies.
IMHO, focusing on physical disk failure rates is focusing too narrowly, but I suppose I'm just joining the chorus of hunches and anecdotes now.
Anecdata, but I use raid-1 on all my disks (since a catastrophic failure 20 years ago) and that shielded me from all disk failures since then (although I may have had silent corruption during the years I never lost any really important data that way, some picture may have got lost that way probably but it has been inconsequential for me).
However I have had bad kernels, power outages, loss of battery power (laptops on too long suspend) and other random reasons to force reboot a system. That has been the primary case of file system checks through my Fedora usage. And luckily so far I never had a loss of filesystem or data that way, fsck always ended up solving most of the issues, and whenever I lost file they ended up being temporary files I did not care for.
I do not think those failures are common in Facebook fleets, so I am quite skeptical FB data and failure modes are representative of Fedora usage as a desktop/laptop OS and therefore of the behavior of btrfs in those cases.
Note, not saying btrfs should be avoided or anything, just that we need more data about those failure modes and how they affect btrfs before a change of defaults.
My 2c, Simo.
On Thu, 2020-07-09 at 16:15 -0400, Simo Sorce wrote:
However I have had bad kernels, power outages, loss of battery power (laptops on too long suspend) and other random reasons to force reboot a system. That has been the primary case of file system checks through my Fedora usage. And luckily so far I never had a loss of filesystem or data that way, fsck always ended up solving most of the issues, and whenever I lost file they ended up being temporary files I did not care for.
I do not think those failures are common in Facebook fleets, so I am quite skeptical FB data and failure modes are representative of Fedora usage as a desktop/laptop OS and therefore of the behavior of btrfs in those cases.
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
Cheers Davide
On Thu, 2020-07-09 at 13:32 -0700, Davide Cavalca via devel wrote:
On Thu, 2020-07-09 at 16:15 -0400, Simo Sorce wrote:
However I have had bad kernels, power outages, loss of battery power (laptops on too long suspend) and other random reasons to force reboot a system. That has been the primary case of file system checks through my Fedora usage. And luckily so far I never had a loss of filesystem or data that way, fsck always ended up solving most of the issues, and whenever I lost file they ended up being temporary files I did not care for.
I do not think those failures are common in Facebook fleets, so I am quite skeptical FB data and failure modes are representative of Fedora usage as a desktop/laptop OS and therefore of the behavior of btrfs in those cases.
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
Oh this is really good to know, it is more reassuring!
Simo.
On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
On Thu, 2020-07-09 at 16:15 -0400, Simo Sorce wrote:
However I have had bad kernels, power outages, loss of battery power (laptops on too long suspend) and other random reasons to force reboot a system. That has been the primary case of file system checks through my Fedora usage. And luckily so far I never had a loss of filesystem or data that way, fsck always ended up solving most of the issues, and whenever I lost file they ended up being temporary files I did not care for.
I do not think those failures are common in Facebook fleets, so I am quite skeptical FB data and failure modes are representative of Fedora usage as a desktop/laptop OS and therefore of the behavior of btrfs in those cases.
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs do not suffer filesystem corruptions and inconsistencies due to reboots and power losses.
So for the record I am in no way insinuating that btrfs is less crash-safe than other filesystems (though I have not tested that, so if I have time I'll throw that into the mix as well.)
We do at times see corrupted filesystems when something has a writeback cache w/o a battery backup, though, because then the hardware violates its guarantees to the filesystem.... this is the sort of thing I'd put in the "misconfiguration" bucket. Which happens from time to time, and from which it is nice to be able to recover w/o heroics.
-Eric
On 7/9/20 4:27 PM, Eric Sandeen wrote:
On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
...
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs do not suffer filesystem corruptions and inconsistencies due to reboots and power losses.
So for the record I am in no way insinuating that btrfs is less crash-safe than other filesystems (though I have not tested that, so if I have time I'll throw that into the mix as well.)
So, we already have those tests in xfstests, and I put btrfs through a few loops. This is generic/475:
# Copyright (c) 2017 Oracle, Inc. All Rights Reserved. # # FS QA Test No. 475 # # Test log recovery with repeated (simulated) disk failures. We kick # off fsstress on the scratch fs, then switch out the underlying device # with dm-error to see what happens when the disk goes down. Having # taken down the fs in this manner, remount it and repeat. This test # is a Good Enough (tm) simulation of our internal multipath failure # testing efforts.
It fails within 2 loops. Is it a critical failure? I don't know; the test looks for unexpected things in dmesg, and perhaps the filter is wrong. But I see stack traces during the run, and message like:
[689284.484258] BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
so I can't say for sure.
Are btrfs devs using these tests to assess crash/powerloss resiliency on a regular basis? TBH I honestly did not expect to see any test failures here, whether or not they are test artifacts; any filesystem using xfstests as a benchmark needs to be keeping things up to date.
As a further test, I skipped the dmesg check, which may or may not be finding false positives, and replaced it with a mount/umount/check cycle. That seems to pass, so if fsck validation is complete and correct, perhaps all is well in this regard.
-Eric
On 7/9/20 7:23 PM, Eric Sandeen wrote:
On 7/9/20 4:27 PM, Eric Sandeen wrote:
On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
...
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs do not suffer filesystem corruptions and inconsistencies due to reboots and power losses.
So for the record I am in no way insinuating that btrfs is less crash-safe than other filesystems (though I have not tested that, so if I have time I'll throw that into the mix as well.)
So, we already have those tests in xfstests, and I put btrfs through a few loops. This is generic/475:
# Copyright (c) 2017 Oracle, Inc. All Rights Reserved. # # FS QA Test No. 475 # # Test log recovery with repeated (simulated) disk failures. We kick # off fsstress on the scratch fs, then switch out the underlying device # with dm-error to see what happens when the disk goes down. Having # taken down the fs in this manner, remount it and repeat. This test # is a Good Enough (tm) simulation of our internal multipath failure # testing efforts.
It fails within 2 loops. Is it a critical failure? I don't know; the test looks for unexpected things in dmesg, and perhaps the filter is wrong. But I see stack traces during the run, and message like:
[689284.484258] BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Yeah, because dm-error throws EIO, and thus we abort the transaction, which results in an EUCLEAN if you run fsync. This is a scary sounding message, but its _exactly_ what's expected from generic/475. I've been running this in a loop for an hour and the thing hasn't failed yet. There's all sorts of scary messages
[17929.939871] BTRFS warning (device dm-13): direct IO failed ino 261 rw 1,34817 sector 0xb8ce0 len 24576 err no 10 [17929.943099] BTRFS: error (device dm-13) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
again, totally expected because we're forcing EIO's at random times.
so I can't say for sure.
Are btrfs devs using these tests to assess crash/powerloss resiliency on a regular basis? TBH I honestly did not expect to see any test failures here, whether or not they are test artifacts; any filesystem using xfstests as a benchmark needs to be keeping things up to date.
It depends on the config options. Some of our transaction abort sites dump stack, and that trips the dmesg filter, and thus it fails. Generally when I run this test I turn those options off.
This test is run constantly by us, specifically because it's the error cases that get you. But not for crash consistency reasons, because we're solid there. I run them to make sure I don't have stupid things like reference leaks or whatever in the error path. Thanks,
Josef
On 7/9/20 8:22 PM, Josef Bacik wrote:
On 7/9/20 7:23 PM, Eric Sandeen wrote:
On 7/9/20 4:27 PM, Eric Sandeen wrote:
On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
...
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs do not suffer filesystem corruptions and inconsistencies due to reboots and power losses.
So for the record I am in no way insinuating that btrfs is less crash-safe than other filesystems (though I have not tested that, so if I have time I'll throw that into the mix as well.)
So, we already have those tests in xfstests, and I put btrfs through a few loops. This is generic/475:
# Copyright (c) 2017 Oracle, Inc. All Rights Reserved. # # FS QA Test No. 475 # # Test log recovery with repeated (simulated) disk failures. We kick # off fsstress on the scratch fs, then switch out the underlying device # with dm-error to see what happens when the disk goes down. Having # taken down the fs in this manner, remount it and repeat. This test # is a Good Enough (tm) simulation of our internal multipath failure # testing efforts.
It fails within 2 loops. Is it a critical failure? I don't know; the test looks for unexpected things in dmesg, and perhaps the filter is wrong. But I see stack traces during the run, and message like:
[689284.484258] BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
You might want to change that message, then. If it's not corrupted, I'd suggest not doing printk("corrupted!") because that will make people think that it's corrupted, because it says "Filesystem corrupted..." ;)
Yeah, because dm-error throws EIO, and thus we abort the transaction, which results in an EUCLEAN if you run fsync. This is a scary sounding message, but its _exactly_ what's expected from generic/475. I've been running this in a loop for an hour and the thing hasn't failed yet. There's all sorts of scary messages
That's weird. The test fails very quickly for me - again, AFAICT it fails due to things in dmesg that aren't recognized as safe by the test harness, but a variety of things - not just stack dumps - seem to trigger the failure.
[17929.939871] BTRFS warning (device dm-13): direct IO failed ino 261 rw 1,34817 sector 0xb8ce0 len 24576 err no 10 [17929.943099] BTRFS: error (device dm-13) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
again, totally expected because we're forcing EIO's at random times.
Right, of course it will get IO errors, that's why I didn't highlight those in my email.
so I can't say for sure.
Are btrfs devs using these tests to assess crash/powerloss resiliency on a regular basis? TBH I honestly did not expect to see any test failures here, whether or not they are test artifacts; any filesystem using xfstests as a benchmark needs to be keeping things up to date.
It depends on the config options. Some of our transaction abort sites dump stack, and that trips the dmesg filter, and thus it fails. Generally when I run this test I turn those options off.
It would be good, in general, to fix up the test for btrfs so that it does not yield false positives, if that's what this is. Otherwise it trains people to ignore it or not run it....
This test is run constantly by us, specifically because it's the error cases that get you. But not for crash consistency reasons, because we're solid there. I run them to make sure I don't have stupid things like reference leaks or whatever in the error path. Thanks,
or "corrupted!" printk()s that terrify the hapless user? ;)
-Eric
On 7/9/20 9:30 PM, Eric Sandeen wrote:
On 7/9/20 8:22 PM, Josef Bacik wrote:
On 7/9/20 7:23 PM, Eric Sandeen wrote:
On 7/9/20 4:27 PM, Eric Sandeen wrote:
On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
...
As someone on one of the teams at FB that has to deal with that, I can assure you all the scenarios you listed can and do happen, and they happen a lot. While we don't have the "laptop's out of battery" issue on the production side, we have plenty of power events and unplanned maintenances that can and will hit live machines and cut power off. Force reboots (triggered by either humans or automation) are also not at all uncommon. Rebuilding machines from scratch isn't free, even with all the automation and stuff we have, so if power loss or reboot events on machines using btrfs caused widespread corruption or other issues I'm confident we'd have found that out pretty early on.
It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs do not suffer filesystem corruptions and inconsistencies due to reboots and power losses.
So for the record I am in no way insinuating that btrfs is less crash-safe than other filesystems (though I have not tested that, so if I have time I'll throw that into the mix as well.)
So, we already have those tests in xfstests, and I put btrfs through a few loops. This is generic/475:
# Copyright (c) 2017 Oracle, Inc. All Rights Reserved. # # FS QA Test No. 475 # # Test log recovery with repeated (simulated) disk failures. We kick # off fsstress on the scratch fs, then switch out the underlying device # with dm-error to see what happens when the disk goes down. Having # taken down the fs in this manner, remount it and repeat. This test # is a Good Enough (tm) simulation of our internal multipath failure # testing efforts.
It fails within 2 loops. Is it a critical failure? I don't know; the test looks for unexpected things in dmesg, and perhaps the filter is wrong. But I see stack traces during the run, and message like:
[689284.484258] BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
You might want to change that message, then. If it's not corrupted, I'd suggest not doing printk("corrupted!") because that will make people think that it's corrupted, because it says "Filesystem corrupted..." ;)
Yeah probably not the best, but again not something a user will generally see.
Yeah, because dm-error throws EIO, and thus we abort the transaction, which results in an EUCLEAN if you run fsync. This is a scary sounding message, but its _exactly_ what's expected from generic/475. I've been running this in a loop for an hour and the thing hasn't failed yet. There's all sorts of scary messages
That's weird. The test fails very quickly for me - again, AFAICT it fails due to things in dmesg that aren't recognized as safe by the test harness, but a variety of things - not just stack dumps - seem to trigger the failure.
Do you know what's tripping it? Because my loop is still running happily along.
[17929.939871] BTRFS warning (device dm-13): direct IO failed ino 261 rw 1,34817 sector 0xb8ce0 len 24576 err no 10 [17929.943099] BTRFS: error (device dm-13) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
again, totally expected because we're forcing EIO's at random times.
Right, of course it will get IO errors, that's why I didn't highlight those in my email.
so I can't say for sure.
Are btrfs devs using these tests to assess crash/powerloss resiliency on a regular basis? TBH I honestly did not expect to see any test failures here, whether or not they are test artifacts; any filesystem using xfstests as a benchmark needs to be keeping things up to date.
It depends on the config options. Some of our transaction abort sites dump stack, and that trips the dmesg filter, and thus it fails. Generally when I run this test I turn those options off.
It would be good, in general, to fix up the test for btrfs so that it does not yield false positives, if that's what this is. Otherwise it trains people to ignore it or not run it....
Except it doesn't, it's not failing for me now. Like I said we pay particularly close attention to this test because it has a habit of finding memory leaks or reference accounting bugs.
This test is run constantly by us, specifically because it's the error cases that get you. But not for crash consistency reasons, because we're solid there. I run them to make sure I don't have stupid things like reference leaks or whatever in the error path. Thanks,
or "corrupted!" printk()s that terrify the hapless user? ;)
I'd love to know what hapless user is running xfstests. Thanks,
Josef
On 7/9/20 9:15 PM, Josef Bacik wrote:
On 7/9/20 9:30 PM, Eric Sandeen wrote:
...
This test is run constantly by us, specifically because it's the error cases that get you. But not for crash consistency reasons, because we're solid there. I run them to make sure I don't have stupid things like reference leaks or whatever in the error path. Thanks,
or "corrupted!" printk()s that terrify the hapless user? ;)
I'd love to know what hapless user is running xfstests. Thanks,
*sigh*
the point is, telling the user "your filesystem is corrupted" if it's not actually corrupted is bad news. Discovering that communication problem via xfstests does not make the concern less valid. I was trying to gently tease you that the test not only discovers leaks, but also discovers terrifyingly inaccurate messages in response to IO errors, but I guess that didn't come through.
Thanks.
-Eric
On Thu, Jul 9, 2020 at 4:16 PM Simo Sorce simo@redhat.com wrote:
On Thu, 2020-07-09 at 12:56 -0700, Eric Sandeen wrote:
On 7/9/20 2:11 PM, Josef Bacik wrote:
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
But again, let me reiterate that disk failures are far from the only reason that admins need capable filesystem repair tools, in general.
We see users running fsck all the time, for various reasons. I can't back it up, but my hunch is that bugs and misconfigurations (i.e. write cache) are more often the root cause for filesystem inconsistencies.
IMHO, focusing on physical disk failure rates is focusing too narrowly, but I suppose I'm just joining the chorus of hunches and anecdotes now.
Anecdata, but I use raid-1 on all my disks (since a catastrophic failure 20 years ago) and that shielded me from all disk failures since then (although I may have had silent corruption during the years I never lost any really important data that way, some picture may have got lost that way probably but it has been inconsequential for me).
However I have had bad kernels, power outages, loss of battery power (laptops on too long suspend) and other random reasons to force reboot a system. That has been the primary case of file system checks through my Fedora usage. And luckily so far I never had a loss of filesystem or data that way, fsck always ended up solving most of the issues, and whenever I lost file they ended up being temporary files I did not care for.
I do not think those failures are common in Facebook fleets, so I am quite skeptical FB data and failure modes are representative of Fedora usage as a desktop/laptop OS and therefore of the behavior of btrfs in those cases.
Note, not saying btrfs should be avoided or anything, just that we need more data about those failure modes and how they affect btrfs before a change of defaults.
Maybe it's not the most helpful anecdotal data, but one of my computers has been suffering through random CPU lockups to the point where everything freezes and I need to reboot. I'm pretty sure there's a fault in RAM or CPU (but with everything soldered on computers these days...), but it's been nice to see that with these issues happening to me fairly frequently (as in now basically weekly) forcing me to power cycle, Btrfs has withstood that perfectly. No data loss, no corruption, no inconsistencies. Everything just works. :)
The last time I had something like this with an ext4 system, it got torched within a month, forcing me to spend money I couldn't really afford to spend to replace the machine. Btrfs is letting me use a bad computer longer. :)
On Thu, Jul 9, 2020 at 1:57 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/9/20 2:11 PM, Josef Bacik wrote:
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
But again, let me reiterate that disk failures are far from the only reason that admins need capable filesystem repair tools, in general.
We see users running fsck all the time, for various reasons. I can't back it up, but my hunch is that bugs and misconfigurations (i.e. write cache) are more often the root cause for filesystem inconsistencies.
IMHO, focusing on physical disk failure rates is focusing too narrowly, but I suppose I'm just joining the chorus of hunches and anecdotes now.
Actually there's quite a lot of evidence of this, even though there's no precise estimate - not least of which these populations are constantly dying and reemerging, and can be batch (firmware version) specific. This is only the most recent such story on linux-btrfs@ (and warning, this reads like an alien autopsy):
https://lore.kernel.org/linux-btrfs/20200708034407.GE10769@hungrycats.org/
fsck.btrfs is a no op, same as fsck.xfs. And recently the actual repair utility dissuades users from running it casually.
COW file systems are different. ZFS has no fsck to speak of, it can be harrassed badly by hardware/firmware bugs too, and yet there aren't many people who consider ZFS a problemed file system. How would the story of Btrfs be different either without dm-log-writes to this day, or had it already arrived in 2010?
On Thu, 9 Jul 2020 at 16:49, Chris Murphy lists@colorremedies.com wrote:
On Thu, Jul 9, 2020 at 1:57 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/9/20 2:11 PM, Josef Bacik wrote:
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
But again, let me reiterate that disk failures are far from the only reason that admins need capable filesystem repair tools, in general.
We see users running fsck all the time, for various reasons. I can't back it up, but my hunch is that bugs and misconfigurations (i.e. write cache) are more often the root cause for filesystem inconsistencies.
IMHO, focusing on physical disk failure rates is focusing too narrowly, but I suppose I'm just joining the chorus of hunches and anecdotes now.
Actually there's quite a lot of evidence of this, even though there's no precise estimate - not least of which these populations are constantly dying and reemerging, and can be batch (firmware version) specific. This is only the most recent such story on linux-btrfs@ (and warning, this reads like an alien autopsy):
https://lore.kernel.org/linux-btrfs/20200708034407.GE10769@hungrycats.org/
fsck.btrfs is a no op, same as fsck.xfs. And recently the actual repair utility dissuades users from running it casually.
COW file systems are different. ZFS has no fsck to speak of, it can be harrassed badly by hardware/firmware bugs too, and yet there aren't many people who consider ZFS a problemed file system. How would the story of Btrfs be different either without dm-log-writes to this day, or had it already arrived in 2010?
That is because anyone who questions the perfection of ZFS is quickly burned at a stake.
I don't know what it is about filesystems turning into religions that do not brook questioning but what I am seeing in these emails is what turns me off of btrfs every time it is brought up in the same way I couldn't stand reiser, ZFS, or various other filesystems.. I realize filesystems take a lot of faith as people have to put something they value into a leap of faith it will be there the next day.. but it seems to morph quickly into some sort of fanatical evangelical movement.
So a good reason why no one brings it up.. you learn quickly that questioning the perfection of any filesystem will fill your inbox with tirades from people.
On Thu, Jul 9, 2020 at 3:06 PM Stephen John Smoogen smooge@gmail.com wrote:
That is because anyone who questions the perfection of ZFS is quickly burned at a stake.
I think Neal also has a good take on why, which is that it was mostly a closed door development early on, wasn't used on heterogeneous hardware out in the wild, upon release wasn't commonly available for years - and just never really got the same kind of scrutiny and rumor that Btrfs did.
I don't know what it is about filesystems turning into religions that do not brook questioning but what I am seeing in these emails is what turns me off of btrfs every time it is brought up in the same way I couldn't stand reiser, ZFS, or various other filesystems.. I realize filesystems take a lot of faith as people have to put something they value into a leap of faith it will be there the next day.. but it seems to morph quickly into some sort of fanatical evangelical movement.
I've said this same thing in recent weeks. I don't understand it. I don't know if you think I've done this. Certainly my experience over 10 years has been Btrfs developers have been among the least defensive, and the first to say it doesn't meet every use case and of course folks should use the file system that fits their requirements the best.
So a good reason why no one brings it up.. you learn quickly that questioning the perfection of any filesystem will fill your inbox with tirades from people.
Yeah that's kind of an obnoxious pig pen.
-- Chris Murphy
On 7/9/20 3:38 PM, Chris Murphy wrote:
On Thu, Jul 9, 2020 at 1:57 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/9/20 2:11 PM, Josef Bacik wrote:
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
But again, let me reiterate that disk failures are far from the only reason that admins need capable filesystem repair tools, in general.
We see users running fsck all the time, for various reasons. I can't back it up, but my hunch is that bugs and misconfigurations (i.e. write cache) are more often the root cause for filesystem inconsistencies.
IMHO, focusing on physical disk failure rates is focusing too narrowly, but I suppose I'm just joining the chorus of hunches and anecdotes now.
Actually there's quite a lot of evidence of this, even though there's no precise estimate - not least of which these populations are constantly dying and reemerging, and can be batch (firmware version) specific. This is only the most recent such story on linux-btrfs@ (and warning, this reads like an alien autopsy):
https://lore.kernel.org/linux-btrfs/20200708034407.GE10769@hungrycats.org/
fsck.btrfs is a no op, same as fsck.xfs. And recently the actual repair utility dissuades users from running it casually.
Honestly, that's not relevant. They are no-ops because they do not need to be run at boot time after an unclean shutdown, because the filesystems are explicitly designed to handle that. This is clearly stated in the man page, the script itself, and the commit log. In fact fsck.btrfs was copied from fsck.xfs.
(Honestly fsck.ext[34] could be a no-op too, but for $REASONS it chooses to do journal replay in userspace instead, via fsck.)
They are no-ops for this reason, and /not/ because fsck isn't /ever/ expected to be needed.
-Eric
----- Original Message -----
From: "Josef Bacik" josef@toxicpanda.com To: devel@lists.fedoraproject.org Sent: Thursday, July 9, 2020 9:11:07 PM Subject: Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants
On 7/9/20 1:51 PM, Eric Sandeen wrote:
On 7/6/20 12:07 AM, Chris Murphy wrote:
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen sandeen@redhat.com wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
What's the probability of this kind of corruption occurring in the real world? If the probability is so low it can't practically be computed, how do we assess the risk? And if we can't assess risk, what's the basis of concern?
From 20 years of filesystem development experience, I know that people run filesystem repair tools. It's just a fact. For a wide variety of reasons - from bugs, to hardware errors, to admin errors, you name it, filesystems experience corruption and inconsistencies. At that point the administrator needs a path forward.
"people won't need to repair btrfs" is, IMHO, the position that needs to be supported, not "filesystem repair tools should be robust."
I ran 50 loops, and got:
46 btrfsck failures 20 mount failures
So it ran btrfs restore 20 times; of those, 11 runs lost all or substantially all of the files; 17 runs lost at least 1/3 of the files.
Josef states reliability of ext4, xfs, and Btrfs are in the same ballpark. He also reports one case in 10 years in which he failed to recover anything. How do you square that with 11 complete failures, trivially produced? Is there even a reason to suspect there's residual risk?
Extrapolating from Facebook's usecases to the fedora desktop should be approached with caution, IMHO.
I've provided evidence that if/when damage happens for whatever reason, btrfs is unable to recover in place far more often than other filesytems.
When metadata is single profile, Btrfs is basically an early warning system.> The available research on uncorrectable errors, errors that drive ECC does not catch, suggests that users are decently likely to experience at least one block of corruption in the life of the drive. And that it tends to get worse up until drive failure. But there is much less chance to detect this, if the file system isn't also checksumming the vastly larger payload on a drive: the data.
One of the problems in this whole discussion is the assumption that filesystem inconsistencies only arise from disk bitflips etc; that's just not the case.
Look, I'm just providing evidence of what I've found when re-evaluating the btrfs administration/repair tools. I've found them to be quite weak.
From what I've gathered from these responses, btrfs is unique in that it is /expected/ that if anything goes wrong, the administrator should be prepared to scrape out remaining data, re-mkfs, and start over. If that's acceptable for the Fedora desktop, that's fine, but I consider it a risk that should not be ignored when evaluating this proposal.
Agreed, it's the very first thing I said when I was asked what are the downsides. There's clearly more work to be done in the recovery arena. How often do disks fail for Fedora? Do we have that data? Is this a real risk? Nobody can say because Fedora doesn't have data.
We see installer bugs semi regularly, that turn out to be storage hardware failures semi regularly (attached journal full of IO errors), so these do happen. Unfortunately I'm afraid there is not an easy way to get a count of these specific bugs over time...
Facebook does however have that data, and it's a microscopically small percentage. I agree that Facebook is vastly different from Fedora from a recovery standpoint, but our workloads and hardware I think extrapolate to the normal Fedora user quite well. We drive the disks harder than the normal Fedora user does of course, but in the end we're updating packages, taking snapshots, and building code. We're just doing it at 1000x what a normal Fedora user does. Thanks,
Josef _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On 7/3/20 10:39 PM, Eric Sandeen wrote:
On 7/3/20 1:41 PM, Chris Murphy wrote:
SSDs can fail in weird ways. Some spew garbage as they're failing, some go read-only. I've seen both. I don't have stats on how common it is for an SSD to go read-only as it fails, but once it happens you cannot fsck it. It won't accept writes. If it won't mount, your only chance to recover data is some kind of offline scrape tool. And Btrfs does have a very very good scrape tool, in terms of its success rate - UX is scary. But that can and will improve.
Ok, you and Josef have both recommended the btrfs restore ("scrape") tool as a next recovery step after fsck fails, and I figured we should check that out, to see if that alleviates the concerns about recoverability of user data in the face of corruption.
I also realized that mkfs of an image isn't representative of an SSD system typical of Fedora laptops, so I added "-m single" to mkfs, because this will be the mkfs.btrfs default on SSDs (right?). Based on Josef's description of fsck's algorithm of throwing away any block with a bad CRC this seemed worth testing.
I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G image, or a bit less than 1% of the filesystem blocks, at random. This is 1/4 the fuzzing rate from the original test.
So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, mount, mount w/ recovery, and then restore ("scrape") if all that fails, see what we get.
I ran 50 loops, and got:
46 btrfsck failures 20 mount failures
So it ran btrfs restore 20 times; of those, 11 runs lost all or substantially all of the files; 17 runs lost at least 1/3 of the files.
Hmm I wonder if some of my "ignore X failures" stuff got lost over the years, we should be able to recover far more than that. I'll add it to the list of things to dig into this week. Thanks,
Josef
On Fri, Jul 3, 2020 at 9:14 AM Josef Bacik josef@toxicpanda.com wrote:
On 7/3/20 9:37 AM, Eric Sandeen wrote:
Does btrfsck really never attempt to salvage a metadata block with a bad CRC by validating its fields?
No, I suppose we could, I'll add it to the list. Generally speaking if there's a bad checksum detected we just attempt to recover based on what we couldn't get access to. However that's difficult if it's a node. If it's a leaf then usually you just lose some metadata that can be inferred from other data. For example if you lose a leaf in the extent tree, well we can add all that information back once we've scanned the rest of the file system and know what extents are missing in the extent tree.
Same goes for directory items, we detect that we are missing directory items, but we have references for them and so we add the missing directory items that were lost from that corrupt block.
But again, if you lose a node you lose access to many leaves, which makes it more likely we'll lose somehting because we'll lose the other information we can use to recover what was lost. The extent tree and checksum trees are exceptions to this, since they can be rebuilt from scratch, provided everything else is fine.
And then if we did decide to validate nodes, we _might_ be ok, but we might end up with old versions of leaves because it happens to point at something that appears to be correct, but isn't really. Our metadata changes all the time, so it's not outside the realm of possiblities that the corruption points at a seemlingly valid piece of metadata, but isn't and thus makes us do something _really_ wrong. Thanks,
Maybe it's reasonable to expect 'btrfs check --repair' to look for plausible alternatives when using non-crypto checksums that mismatch. But I'm not certain it's OK when using cryptographic checksums - how do you distinguish between incidental corruption and a malicious attack? The repair might be the attack vector.
On Wed, Jul 01, 2020 at 03:50:37PM -0400, Josef Bacik wrote:
I've stated this many times before, btrfs is more vulnerable to things going wrong. It's also more likely to notice things going wrong. There's things we can do to make it easier in the face of these issues, they're patches I've written and submitted in the last few days. There's bigger, more complex things that I can do to make us more resilient in the face of these corruptions. But even with all of the things I have in my head, I could still go do one or two things and render the file system unusable. Would these things happen in practice? Unlikely. Is it impossible? Unfortunately no.
Thanks Josef. I definitely appreciate your responsiveness here, and your explanation helps me understand things better.
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
On Mon, 2020-07-06 at 18:48 -0400, Gerald Henriksen wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
If you mean the EFI boot manager entry, just renaming the existing one something other than "Fedora" ought to do the trick, I think. So far as /boot/efi goes...well, you have two choices. You can have the two installs share one, or have two separate ones. I *think* both options at least in theory ought to work, I'm not sure if anyone's tested...
On Mon, 2020-07-06 at 16:24 -0700, Adam Williamson wrote:
On Mon, 2020-07-06 at 18:48 -0400, Gerald Henriksen wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
If you mean the EFI boot manager entry, just renaming the existing one something other than "Fedora" ought to do the trick, I think. So far as /boot/efi goes...well, you have two choices. You can have the two installs share one, or have two separate ones. I *think* both options at least in theory ought to work, I'm not sure if anyone's tested...
actually, no, thinking about it harder, I think sharing one wouldn't work right. I think you have to have a new /boot/efi for the second install, as well as a separate /boot.
On 7/6/20 4:24 PM, Adam Williamson wrote:
If you mean the EFI boot manager entry, just renaming the existing one something other than "Fedora" ought to do the trick, I think. So far as /boot/efi goes...well, you have two choices. You can have the two installs share one, or have two separate ones. I *think* both options at least in theory ought to work, I'm not sure if anyone's tested...
Adding another EFI partition should work, but I don't see how they could share one. There's a single Fedora directory in there, so each install would overwrite the files of the other. The grub.cfg can only point to one set of boot loader entries.
On Mon, Jul 6, 2020 at 5:30 PM Samuel Sieb samuel@sieb.net wrote:
On 7/6/20 4:24 PM, Adam Williamson wrote:
If you mean the EFI boot manager entry, just renaming the existing one something other than "Fedora" ought to do the trick, I think. So far as /boot/efi goes...well, you have two choices. You can have the two installs share one, or have two separate ones. I *think* both options at least in theory ought to work, I'm not sure if anyone's tested...
Adding another EFI partition should work, but I don't see how they could share one. There's a single Fedora directory in there, so each install would overwrite the files of the other. The grub.cfg can only point to one set of boot loader entries.
And that means sharing one ESP means sharing /boot - yeah. Hmm. I'll have to test it but I'm pretty sure it's a fairly simple post-install fix to get that to work. I'm not totally certain how blscfg.mod parses /boot/loader/entries containing bls snippets with two machine IDs.
Ohhhh I just thought of something. F32 and older depend on grubenv to store part of the kernel command line variables. So I think it will be necessary to do an F32 F33 side by side install for it to work. Two F32's side by side will compete over the single grubenv.
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's (b) add new NVRAM entries, and names, and switch between them before reboot by using efibootmgr --bootorder or --bootnext.
Another option is shared ESP and /boot but my vague recollection is some things go away. For sure /boot/efi/EFI/fedora is replaced, and then possibly /boot/loader/entries are replaced. But that might be easier to deal with than the above, and more efficient.
On Mon, Jul 06, 2020 at 08:06:05PM -0600, Chris Murphy wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's (b) add new NVRAM entries, and names, and switch between them before reboot by using efibootmgr --bootorder or --bootnext.
Another option is shared ESP and /boot but my vague recollection is some things go away. For sure /boot/efi/EFI/fedora is replaced, and then possibly /boot/loader/entries are replaced. But that might be easier to deal with than the above, and more efficient.
This is so sad. Boot Loader Specification was explicitly designed to support parallel installations on a single ESP. (The case of different systems was the goal, but the general logic works for different installations of the same system as well.) BLS entries are stored underneath $ESP/<machine-id>, so different Fedora installations which have different machine-id numbers simply don't conflict. sd-boot just displays the combined list. If two entries happen to be *exactly* the same — same os name, same os version, same kernel version — it'll use the machine-id in the entry title to disambiguate them to the user (*).
There is really no reason for this not to work. If are considering separate ESPs and efibootmgr to switch between them then something went rather wrong somewhere.
(*) If there are two installations with overlapping kernel versions, the UI is not going to be great, because there will be entries like Fedora 33 (Workstation) 5.11.21-23.fc33.amd64 08a5690a2eed47cf92ac0a5d2e3cf6b0 Fedora 33 (Workstation) 5.11.21-23.fc33.amd64 949499494994999393939ad2ad99ffff Fedora 33 (Workstation) 5.10.11-18.fc33.amd64 08a5690a2eed47cf92ac0a5d2e3cf6b0 Fedora 33 (Workstation) 5.10.11-18.fc33.amd64 949499494994999393939ad2ad99ffff i.e. the entries for the two installations will be interleaved. So the user needs to remember that e.g. 08a5690a2eed47cf92ac0a5d2e3cf6b0 is the installation with ext4 and 949499494994999393939ad2ad99ffff the one with btrfs.
Zbyszek
On Tue, 2020-07-07 at 06:02 +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jul 06, 2020 at 08:06:05PM -0600, Chris Murphy wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's (b) add new NVRAM entries, and names, and switch between them before reboot by using efibootmgr --bootorder or --bootnext.
Another option is shared ESP and /boot but my vague recollection is some things go away. For sure /boot/efi/EFI/fedora is replaced, and then possibly /boot/loader/entries are replaced. But that might be easier to deal with than the above, and more efficient.
This is so sad. Boot Loader Specification was explicitly designed to support parallel installations on a single ESP. (The case of different systems was the goal, but the general logic works for different installations of the same system as well.) BLS entries are stored underneath $ESP/<machine-id>, so different Fedora installations which have different machine-id numbers simply don't conflict. sd-boot just displays the combined list. If two entries happen to be *exactly* the same — same os name, same os version, same kernel version — it'll use the machine-id in the entry title to disambiguate them to the user (*).
There is really no reason for this not to work. If are considering separate ESPs and efibootmgr to switch between them then something went rather wrong somewhere.
I can't speak for Chris, but I was honestly just gaming it out in my head, trying to think how I'd try it if I was going to do it. I've never actually tried it myself.
On 7 July 2020 18:31:32 CEST, Adam Williamson adamwill@fedoraproject.org wrote:
On Tue, 2020-07-07 at 06:02 +0000, Zbigniew Jędrzejewski-Szmek wrote:
On Mon, Jul 06, 2020 at 08:06:05PM -0600, Chris Murphy wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's (b) add new NVRAM entries, and names, and switch between them before reboot by using efibootmgr --bootorder or --bootnext.
Another option is shared ESP and /boot but my vague recollection is some things go away. For sure /boot/efi/EFI/fedora is replaced, and then possibly /boot/loader/entries are replaced. But that might be easier to deal with than the above, and more efficient.
This is so sad. Boot Loader Specification was explicitly designed to support parallel installations on a single ESP. (The case of different systems was the goal, but the general logic works for different installations of the same system as well.) BLS entries are stored underneath $ESP/<machine-id>, so different Fedora installations which have different machine-id numbers simply don't conflict. sd-boot just displays the combined list. If two entries happen to be *exactly* the same — same os name, same os version, same kernel version — it'll use the machine-id in the entry title to disambiguate them to the user (*).
There is really no reason for this not to work. If are considering separate ESPs and efibootmgr to switch between them then something went rather wrong somewhere.
I can't speak for Chris, but I was honestly just gaming it out in my head, trying to think how I'd try it if I was going to do it. I've never actually tried it myself.
The easy way to do it is to keep the same ESP and solve it with a nice little GRUB config. It works well even between distributions. You can of course break it by having one of the distributions overwrite it wrongly but that's easily fixed and prevented.
M
On Mo, 06.07.20 20:06, Chris Murphy (lists@colorremedies.com) wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either
Thou shallt not have multiple ESPs per disk. See:
https://news.ycombinator.com/item?id=16261695
The EFI spec is kinda vague about it, but it breaks everywhere, in particular with Windows.
Lennart
-- Lennart Poettering, Berlin
On Tue, Jul 7, 2020 at 9:25 AM Lennart Poettering mzerqung@0pointer.de wrote:
Thou shallt not have multiple ESPs per disk. See:
https://news.ycombinator.com/item?id=16261695
The EFI spec is kinda vague about it, but it breaks everywhere, in particular with Windows.
The Windows *installer* doesn't like it. I'm not aware of Windows itself having difficulty with it. I have tested this layout. But, it could be there are UEFI bugs abound.
In any case, this is not general advocacy of two ESPs, so thanks for the criticism. To be clear, I offer it only for advanced users who are somewhat prepared for confusion of unknown manifestation. :)
On Mon, 2020-07-06 at 20:06 -0600, Chris Murphy wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's
No, you have to rename the first one before doing the second install. anaconda explicitly deletes any existing efibootmgr entry named "Fedora" before creating a new one.
On Tue, Jul 7, 2020, at 12:30 PM, Adam Williamson wrote:
On Mon, 2020-07-06 at 20:06 -0600, Chris Murphy wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's
No, you have to rename the first one before doing the second install. anaconda explicitly deletes any existing efibootmgr entry named "Fedora" before creating a new one.
Any idea if this process is documented?
I typically install on a laptop, with the "encrypt my data" option.
I can confirm that the only way to successfully have 2 side-by-side Fedora installs with UEFI, using only Anaconda to set it up, is to have 2 separate physical disks, and choose which physical disk to boot by hitting F12 at machine power on.
Any attempts to share /boot result in at least one of the installs being broken.
Any attempts to share /boot/efi breaks at least fedora-by-fedora installs.
Adding a separate /boot/efi partition for the second Fedora install makes the resulting system usable on the new Fedora install, but there is no obvious way to boot into the older Fedora install.
If you unlock the disks within Anaconda for the existing Fedora install, grub gets boot entries for that install, but they are non-functional. (No password is prompted for unlocking the disk, indefinite hang.)
What /does/ seem to work is having RHEL and Fedora side-by-side on the same disk, as long as each has its own /boot and /boot/efi partitions.
Generally, I'd like the fedora-by-fedora parallel installs to work better because that's how I'm best able to participate in the Test Matrix.
V/r, James Cassell
On Wed, 2020-07-08 at 17:23 -0400, James Cassell wrote:
On Tue, Jul 7, 2020, at 12:30 PM, Adam Williamson wrote:
On Mon, 2020-07-06 at 20:06 -0600, Chris Murphy wrote:
On Mon, Jul 6, 2020 at 4:48 PM Gerald Henriksen ghenriks@gmail.com wrote:
On Wed, 1 Jul 2020 14:24:37 -0400, you wrote:
On Wed, Jul 01, 2020 at 06:54:02AM +0000, Zbigniew J?drzejewski-Szmek wrote:
Making btrfs opt-in for F33 and (assuming the result go well) opt-out for F34 could be good option. I know technically it is already opt-in, but it's not very visible or popular. We could make the btrfs option more prominent and ask people to pick it if they are ready to handle potential fallout.
I'm leaning towards recommending this as well. I feel like we don't have good data to make a decision on -- the work that Red Hat did previously when making a decision was 1) years ago and 2) server-focused, and the Facebook production usage is encouraging but also not the same use case. I'm particularly concerned about metadata corruption fragility as noted in the Usenix paper. (It'd be nice if we could do something about that!)
So if one has a spare partition to play with btrfs, is there an easy way to install a second copy of Fedora without having the /boot/efi/ entries overwrite the existing Fedora installation? Or fix it to have 2 separate entries after the fact?
It's possible but has challenges. Separate ESP's you'll need to either (a) use the firmware's built-in boot manager to choose what will probably appear to be identically named Fedora's
No, you have to rename the first one before doing the second install. anaconda explicitly deletes any existing efibootmgr entry named "Fedora" before creating a new one.
Any idea if this process is documented?
I think it's `efibootmgr -b XXXX -L DefinitelyNotFedora`, where XXXX is the number of the entry called 'Fedora', which you could find by just running `efibootmgr` to get a list of entries. -b selects the entry to operate on and -L changes the 'label', which I think is what we're dealing with here.
If you do that before doing the second install, you *should* be able to choose between them by using whatever mechanism your firmware offers to select an EFI boot manager entry at boot time. The one called DefinitelyNotFedora would be the first install, the one called Fedora would be the second install.
On 08.07.20 23:47, Adam Williamson wrote:
I think it's `efibootmgr -b XXXX -L DefinitelyNotFedora`, where XXXX is the number of the entry called 'Fedora', which you could find by just running `efibootmgr` to get a list of entries. -b selects the entry to operate on and -L changes the 'label', which I think is what we're dealing with here.
AFAIK efibootmgr can't change the label of an existing entry, but you can delete it and then recreate it with the new name.
Le lundi 29 juin 2020 à 10:26 -0600, Chris Murphy a écrit :
Come on. It's cleanly unmounted and doesn't mount?
I guess you missed the other emails about dm-log-writes and xfstests, but they directly relate here. Josef relayed that all of his deep dives into Btrfs failures since the dm-log-writes work, have all been traced back to hardware doing the wrong thing.
However, software does not work without hadware bellow it, and having storage software that is intolerant of real-world hardware defficiencies, of pushes this hardware in modes the hardware vendor did not test fully, does not bode well for the reliability of the integrated software+hardware system.
On Tue, Jun 30, 2020 at 7:45 AM Nicolas Mailhot via devel devel@lists.fedoraproject.org wrote:
Le lundi 29 juin 2020 à 10:26 -0600, Chris Murphy a écrit :
Come on. It's cleanly unmounted and doesn't mount?
I guess you missed the other emails about dm-log-writes and xfstests, but they directly relate here. Josef relayed that all of his deep dives into Btrfs failures since the dm-log-writes work, have all been traced back to hardware doing the wrong thing.
However, software does not work without hadware bellow it, and having storage software that is intolerant of real-world hardware defficiencies, of pushes this hardware in modes the hardware vendor did not test fully, does not bode well for the reliability of the integrated software+hardware system.
Josef has already said multiple times in the thread that Btrfs has been used on tons of hardware as well, so I do not think this is a problem. It is more sensitive to hardware faults, which I believe is something we should have, so that we don't suffer silent data loss.
Maybe not a desktop question, but do you know btrfs's change attribute/i_version status? Does it default to bumping i_version on each change, or does that still need to be opted in? And has anyone measured the performance delta (i_version vs. noi_version) recently?
--b.
On 6/29/20 12:23 PM, J. Bruce Fields wrote:
Maybe not a desktop question, but do you know btrfs's change attribute/i_version status? Does it default to bumping i_version on each change, or does that still need to be opted in? And has anyone measured the performance delta (i_version vs. noi_version) recently?
Yeah it defaults to bumping it all the time, we just use the normal inode changing infrastructure so it gets updated the same way everybody else does. AFAIK there's no way to opt out of it, unless there's a -o noiversion that exists? Thanks,
Josef
On Mon, Jun 29, 2020 at 01:33:37PM -0400, Josef Bacik wrote:
On 6/29/20 12:23 PM, J. Bruce Fields wrote:
Maybe not a desktop question, but do you know btrfs's change attribute/i_version status? Does it default to bumping i_version on each change, or does that still need to be opted in? And has anyone measured the performance delta (i_version vs. noi_version) recently?
Yeah it defaults to bumping it all the time, we just use the normal inode changing infrastructure so it gets updated the same way everybody else does. AFAIK there's no way to opt out of it, unless there's a -o noiversion that exists?
Yeah, there's a -noiversion.
I decided I should actually go check, and a btrfs filesystem created and mounted with defaults did look like it was doing this right. Good!
--b.
On Mon, Jun 29, 2020 at 01:33:37PM -0400, Josef Bacik wrote:
On 6/29/20 12:23 PM, J. Bruce Fields wrote:
Maybe not a desktop question, but do you know btrfs's change attribute/i_version status? Does it default to bumping i_version on each change, or does that still need to be opted in? And has anyone measured the performance delta (i_version vs. noi_version) recently?
Yeah it defaults to bumping it all the time, we just use the normal inode changing infrastructure so it gets updated the same way everybody else does. AFAIK there's no way to opt out of it, unless there's a -o noiversion that exists?
There's both an iversion and noiversion option to mount: https://man7.org/linux/man-pages/man8/mount.8.html#FILESYSTEM-INDEPENDENT_MO...
It appears that noiversion is the default (though the man page doesn't say so) on ext4 & btrfs, unless my experimentation below is completely off the mark (or something changed recently that my system hasn't picked up yet):
$ touch file $ lsattr -v file 628580 -------------------- file ## metadata-only change: $ touch file $ lsattr -v file 628580 -------------------- file ## ^ no change... ## data change: $ echo test > file $ lsattr -v file 628580 -------------------- file ## ^ still no change $ rm file $ touch file $ lsattr -v file 628582 -------------------- file ## ^ now different
On Fri, Jun 26, 2020 at 11:30 AM Chris Adams linux@cmadams.net wrote:
Once upon a time, Ben Cotton bcotton@redhat.com said:
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
So... I freely admit I have not looked closely at btrfs in some time, so I could be out of date (and my apologies if so). One issue that I have seen mentioned as an issue within the last week is still the problem of running out of space when it still looks like there's space free. I didn't read the responses, so not sure of the resolution, but I remember that being a "thing" with btrfs. Is that still the case? What are the causes, and if so, how can we keep from getting a lot of the same question on mailing lists/forums/etc.?
There's different kinds of enospc. Most have been bugs that are long since fixed (practically the entire enospc infrastructure was rewritten circa 2016, which itself then exposed some prior fixes that became new bugs). But like any bug we'd need to see more info: kernel version and the conditions. That's just sorta how a bug hunt goes, and they're all tedious.
I haven't experienced enospc on btrfs in years. I also do not do any special incantations or scheduled maintenance to avoid it. I just use the file system normally, and if I were to hit enospc I'd collect a bunch of info and file a good bug report. And I know some folks aren't into that, and that's fine.
Those I've seen elsewhere, the typical manifestation is it's a one off transient enospc and you try again and things are fine. The file system stays read-write, and there is no other side effect. Since Btrfs is copy-on-write it takes free space to delete files. A file deletion *consumes* free space in the form of metadata being read, modified to indicate the delete, and then written. Only after that succeeds is space freed up as a result of the delete.
I have a few file systems I intentionally keep around at 99% full just because I'm a file system saboteur but also if I hit enospc, it's still a bug. I'm still gonna report it. I'm not going to make excuses like "oh it's 99% full, no wonder!" We'll probably hit some enospc edge cases somewhere in Fedora. It does even happen on ext4 and xfs sometimes, though less common because they don't need free space to update metadata.
My inclination is always to not fix problems, collect data, and report them. So it's a valid question to say "no thanks, what is the incantation to fix it, please? i have work to do, not bug reports" should you run into it. And yes there are those things, and they can be done online without repairs or reboots.
I remember when btrfs was going to be the one FS to rule them all, but then had issues, and specific weird cases (like with VM images IIRC at one point),
nodatacow on VM images, whether raw or qcow2 is still recommended. This can be done with 'chattr +C' on either the file while still zero length or on the directory and then it's inherited (for new files as created or copied). So the details of how to do this, who does it, is very much on-topic and a question.
Should the installer set +C on /var/lib/libvir/timages? Or should libvirtd?
What happens if you don't? Well depending on the guest, the image can get pretty fragmented. Curiously the *least* problematic combination is btrfs guest on raw on btrfs. And the worst is NTFS guest on qcow2 on btrfs (in a week, 51000+ extents for that one file). I can't say I notice the performance hit anecdotally but would a benchmark prove it's slower? Maybe, probably. That's a lot of extent tracking and it's only one week of aging.
Technically it's in the category of a recommended optimization. The community can escalate this and say, it's a must have. Same for the compression option in the proposal. Same for anything really, it's a discussion.
Well I have for quite a while used btrfs successfully on fedora and I think it would make much sense for fedora to start using btrfs by default .
I see no one mentined yet: BTRFS is slow on HDDs. It trivially comes from BTRFS being COW. So if you changed a bit in a file, BTRFS will copy a block (or maybe a number of them, not sure this detail matters) to another place, and now your data got fragmented. SSDs may not care, HDDs on the other hand do.
There's a defrag option, but this means that at random times BTRFS will hog your IO. And HDDs doesn't really have much of a room to hog.
Another reason worth mentioning: BTRFS per se is slow. If you look at benchmarks on Phoronix comparing BTRFS with others, BTRFS is rarely even on par with them.
As a matter of fact, I have two Archlinux laptops on BTRFS with compression, both only have HDD. I've been using for 3-4 years BTRFS there I think, maybe more. I made use of BTRFS because I was hoping that using ZSTD would result in less IO. Well, now my overall experience is that it is not rare that systems starting to lag terribly, then I execute `grep "" /proc/pressure/*`, and see someone is hogging IO. Then I pop up `iotop -a` and see among various processes a `[btrfs-cleaner]` and `[btrfs-transacti]`. It may be because of defrag option, I'm not sure…
On Sat, 2020-06-27 at 17:00 +0300, Konstantin Kharlamov wrote:
Another reason worth mentioning: BTRFS per se is slow. If you look at benchmarks on Phoronix comparing BTRFS with others, BTRFS is rarely even on par with them.
Btw, I should also add here: it may be clear that in ideal situtation BTRFS will always be slower than non-COW file systems. The problem however, it is not even on par with the other open-source COW file system, which is ZFS.
Some months ago at my dayjob I was performing benchmarks, and out of curiosity I also compared latest released (as of then, it was 5.6 kernel) BTRFS with latest master of ZFS (which was of a commit b29e31d80 and a kernel 5.4).
The setup was a RAID5 on 10 SSDs, and a benchmark was three 20-minutes long runs of vdbench with random 70% reads and random 30% writes. For BTRFS I also used `space_cache=v2` mount option. Results were:
FS | run 1, IOPS | run 2, IOPS | run 3, IOPS BTRFS | 65723.9 | 56474.5 | 55090.2 ZFS | 96846.1 | 79797.9 | 76249.4
---------
So, summing up this and my previous mail overall, I do not think that for ordinary desktop BTRFS is currently any good, compared to EXT4 or XFS.
On Sat, Jun 27, 2020 at 11:59 AM Konstantin Kharlamov hi-angel@yandex.ru wrote:
On Sat, 2020-06-27 at 17:00 +0300, Konstantin Kharlamov wrote:
Another reason worth mentioning: BTRFS per se is slow. If you look at benchmarks on Phoronix comparing BTRFS with others, BTRFS is rarely even on par with them.
Btw, I should also add here: it may be clear that in ideal situtation BTRFS will always be slower than non-COW file systems. The problem however, it is not even on par with the other open-source COW file system, which is ZFS.
Some months ago at my dayjob I was performing benchmarks, and out of curiosity I also compared latest released (as of then, it was 5.6 kernel) BTRFS with latest master of ZFS (which was of a commit b29e31d80 and a kernel 5.4).
The setup was a RAID5 on 10 SSDs, and a benchmark was three 20-minutes long runs of vdbench with random 70% reads and random 30% writes. For BTRFS I also used `space_cache=v2` mount option. Results were:
FS | run 1, IOPS | run 2, IOPS | run 3, IOPS BTRFS | 65723.9 | 56474.5 | 55090.2 ZFS | 96846.1 | 79797.9 | 76249.4
For the sake of the argument I will accept the above as facts.
But raid5 correlates to the desktop how? Does my desktop workload need 96K IOPS? Do I notice the difference even if it can be measured? And the vdbench command used is? And this particular vdbench command produces a benchmark that mimics what workload found on the desktop? There is no hand waving away the relevance of these questions if you're going to propose performance benchmarks are relevant.
So, summing up this and my previous mail overall, I do not think that for ordinary desktop BTRFS is currently any good, compared to EXT4 or XFS.
Not persuasive.
I think your argument is improved if you say we need more and better benchmarking that mimics the actual workloads we care about; or times a set of test cases that people can try themselves and reproduce.
I use a variety of OS's on a variety of hardware, including the laptop I'm using now which is Fedora and Windows dual boot and I'm not suspicious of performance issues of any kind: Fedora's faster, period. Might it seem even faster if it were ext4? I've done it, and I don't think so. So how do you produce a benchmark that accounts for the user's perception of performance, rather than just raw performance? Because maybe btrfs is faster. Maybe it's slower. But does it matter?
If it took 5 seconds for GNOME Terminal to launch I'd be mad. 21 seconds is just nonsense. So yes it does matter, but what's the threshold at which it matters? These benchmarks aren't capturing either the reality or the feeling. So we need better (more relevant) benchmarks to have a proper discussion.
For the vast majority of things I'm doing, even if it were the case that some things are slower, *I* am still much slower than whatever extra latency there may be. And same goes for the reverse, if btrfs compression makes some things slightly faster, will I notice? I don't know. But is that the only metric? I know for sure my hardware is doing far fewer writes (and reads for that matter), so there is less wear and tear on the hardware. Is that saving me 50 cents or $50 over the life of the hardware? I don't know but I do know it's better than no compression, overall.
On Sat, 2020-06-27 at 13:34 -0600, Chris Murphy wrote:
On Sat, Jun 27, 2020 at 11:59 AM Konstantin Kharlamov hi-angel@yandex.ru wrote:
On Sat, 2020-06-27 at 17:00 +0300, Konstantin Kharlamov wrote:
Another reason worth mentioning: BTRFS per se is slow. If you look at benchmarks on Phoronix comparing BTRFS with others, BTRFS is rarely even on par with them.
Btw, I should also add here: it may be clear that in ideal situtation BTRFS will always be slower than non-COW file systems. The problem however, it is not even on par with the other open-source COW file system, which is ZFS.
Some months ago at my dayjob I was performing benchmarks, and out of curiosity I also compared latest released (as of then, it was 5.6 kernel) BTRFS with latest master of ZFS (which was of a commit b29e31d80 and a kernel 5.4).
The setup was a RAID5 on 10 SSDs, and a benchmark was three 20-minutes long runs of vdbench with random 70% reads and random 30% writes. For BTRFS I also used `space_cache=v2` mount option. Results were:
FS | run 1, IOPS | run 2, IOPS | run 3, IOPS BTRFS | 65723.9 | 56474.5 | 55090.2 ZFS | 96846.1 | 79797.9 | 76249.4
For the sake of the argument I will accept the above as facts.
But raid5 correlates to the desktop how? Does my desktop workload need 96K IOPS? Do I notice the difference even if it can be measured? And the vdbench command used is? And this particular vdbench command produces a benchmark that mimics what workload found on the desktop? There is no hand waving away the relevance of these questions if you're going to propose performance benchmarks are relevant.
Fair enough.
So, summing up this and my previous mail overall, I do not think that for ordinary desktop BTRFS is currently any good, compared to EXT4 or XFS.
Not persuasive.
I think your argument is improved if you say we need more and better benchmarking that mimics the actual workloads we care about; or times a set of test cases that people can try themselves and reproduce.
I use a variety of OS's on a variety of hardware, including the laptop I'm using now which is Fedora and Windows dual boot and I'm not suspicious of performance issues of any kind: Fedora's faster, period. Might it seem even faster if it were ext4? I've done it, and I don't think so. So how do you produce a benchmark that accounts for the user's perception of performance, rather than just raw performance? Because maybe btrfs is faster. Maybe it's slower. But does it matter?
If it took 5 seconds for GNOME Terminal to launch I'd be mad. 21 seconds is just nonsense. So yes it does matter, but what's the threshold at which it matters? These benchmarks aren't capturing either the reality or the feeling. So we need better (more relevant) benchmarks to have a proper discussion.
For the vast majority of things I'm doing, even if it were the case that some things are slower, *I* am still much slower than whatever extra latency there may be. And same goes for the reverse, if btrfs compression makes some things slightly faster, will I notice? I don't know. But is that the only metric? I know for sure my hardware is doing far fewer writes (and reads for that matter), so there is less wear and tear on the hardware. Is that saving me 50 cents or $50 over the life of the hardware? I don't know but I do know it's better than no compression, overall.
Thank you, I see your point. Since it seems to also be present in the reply to my other mail, let's continue there.
On Sat, Jun 27, 2020 at 8:58 pm, Konstantin Kharlamov hi-angel@yandex.ru wrote:
Btw, I should also add here: it may be clear that in ideal situtation BTRFS will always be slower than non-COW file systems. The problem however, it is not even on par with the other open-source COW file system, which is ZFS.
Setting aside concerns about how RAID tests are not relevant to desktop usage (nobody is proposing a change to Fedora Server)... ZFS is not an option for Fedora at all due to its license, so there was not really not much point in this comparison, was there? For the sake of keeping this discussion reasonable, let's limit our comparisons to filesystems that are within the realm of possibility (ext4, xfs, btrfs).
On Sat, Jun 27, 2020 at 8:01 AM Konstantin Kharlamov hi-angel@yandex.ru wrote:
I see no one mentined yet: BTRFS is slow on HDDs. It trivially comes from BTRFS being COW. So if you changed a bit in a file, BTRFS will copy a block (or maybe a number of them, not sure this detail matters) to another place, and now your data got fragmented. SSDs may not care, HDDs on the other hand do.
It's faster on some workloads, slower on others. There are optimizations to help make up for COW: inline extents for small files, and random writes that commit together (i.e. in the same 30s window) will be written as sequential writes. It is true btrfs does not have nearly as many locality optimizations as ext4 and xfs, but at least xfs developers have recently proposed removing those HDD optimizations in favor of optimizations that are more relevant to today's hardware and workloads.
Another reason worth mentioning: BTRFS per se is slow. If you look at benchmarks on Phoronix comparing BTRFS with others, BTRFS is rarely even on par with them.
It wins some. It loses others. Head over to the xfs list and enjoy the benchmark commentary from people who actually understand benchmarking. A recurring theme is that a benchmark is only as relevant to the degree it actually mimics the workload you care about. And most benchmark tools don't do that very well.
Here's a benchmark that's apples to apples because I'm merely timing the time to compile the exact same thing each time, twice. https://docs.google.com/spreadsheets/d/1b-y2WVrQK4ijo1TS5aRe0QROSf8CU3ckTiPQ...
They're all in the same ballpark, except there's a write time hit for the one with zstd:1 on this particular setup (and the compression hit isn't consistent across all hardware or setups, it's case by case - and hence the proposal option for compression indicates applying it selectively to locations we know there's a benefit across the board). But also you can tell there's no read time (decompression) hit from this same data set.
Meanwhile, this is somewhere between embarrassing and comedy: https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems...
Hmmm, 21 seconds to launch GNOME Terminal with an NVMe and you aren't curious about what went wrong? Because obviously something is wrong. The measurement is wrong or the method is wrong or something in the setup is fouling things up. How do you get a fast result with SSD but then such a slow result with NVMe?
It makes no sense, but meh, we'll just publish that shit anyway! LOLZ! And that is how you light your credibility on fire, because you just don't give a crap about it.
On my 9 year old laptop with a mere Samsung 840 EVO, barely under 1 second for GNOME Terminal to launch, following a reboot and login so this is not the result of caching. On my much newer HP Spectre with NVMe, under 0.5s to launch.
My methodology and metrology? I'm using the "one mississippi" method from finger click of the actual app icon to the time I see a cursor in the launched app. Not rocket science.
As a matter of fact, I have two Archlinux laptops on BTRFS with compression, both only have HDD. I've been using for 3-4 years BTRFS there I think, maybe more. I made use of BTRFS because I was hoping that using ZSTD would result in less IO. Well, now my overall experience is that it is not rare that systems starting to lag terribly, then I execute `grep "" /proc/pressure/*`, and see someone is hogging IO. Then I pop up `iotop -a` and see among various processes a `[btrfs-cleaner]` and `[btrfs-transacti]`. It may be because of defrag option, I'm not sure…
There are many btrfs threads. Those actually make it more performant. If you look at their total cpu time though, e.g. ps aux, you'll see it's really small compared to most anything else you might think is idle.
root 366 0.0 0.0 0 0 ? S Jun25 1:22 [btrfs-transacti] root 500 0.0 0.0 0 0 ? S Jun25 1:45 [irq/135-iwlwifi] dbus 538 0.0 0.0 271548 6968 ? S Jun25 1:13 dbus-broker --log 4 --controller 9 --machine-id ce3f1eade82d42bd891a8c15714b13cf --max-bytes 536870912 --m root 1328 0.0 0.1 1273476 10116 ? Sl Jun25 3:00 /opt/teamviewer/tv_bin/teamviewerd -d
There is in fact a WTF moment as a result of this partial listing and it's not btrfs.
BTW this is 2 days of uptime.
On Sat, 2020-06-27 at 12:42 -0600, Chris Murphy wrote:
On Sat, Jun 27, 2020 at 8:01 AM Konstantin Kharlamov hi-angel@yandex.ru wrote:
I see no one mentined yet: BTRFS is slow on HDDs. It trivially comes from BTRFS being COW. So if you changed a bit in a file, BTRFS will copy a block (or maybe a number of them, not sure this detail matters) to another place, and now your data got fragmented. SSDs may not care, HDDs on the other hand do.
It's faster on some workloads, slower on others. There are optimizations to help make up for COW: inline extents for small files, and random writes that commit together (i.e. in the same 30s window) will be written as sequential writes. It is true btrfs does not have nearly as many locality optimizations as ext4 and xfs, but at least xfs developers have recently proposed removing those HDD optimizations in favor of optimizations that are more relevant to today's hardware and workloads.
Another reason worth mentioning: BTRFS per se is slow. If you look at benchmarks on Phoronix comparing BTRFS with others, BTRFS is rarely even on par with them.
It wins some. It loses others.
This sounds very wrong. This deludes readers into thinking BTRFS is on par with other FSes. If you head over to the Phoronix article you linked below and try to count how many times BTRFS was winning/on par/lost, you'll see the ratio is not even close on the BTRFS side.
To save you the effort, it is:
type | win | on par | lose NVMe: | 3 | 4 | 8 SATA SSD: | 0 | 5 | 10 USB SSD: | 0 | 1 | 4
FYI, in this calculation I took the BTRFS side a few times, and counted it as either "winning" or "on par". It was when it had a head against only part of other FSes. (Idk why "USB SSD" has many tests missing)
Head over to the xfs list and enjoy the benchmark commentary from people who actually understand benchmarking. A recurring theme is that a benchmark is only as relevant to the degree it actually mimics the workload you care about. And most benchmark tools don't do that very well.
Here's a benchmark that's apples to apples because I'm merely timing the time to compile the exact same thing each time, twice.
https://docs.google.com/spreadsheets/d/1b-y2WVrQK4ijo1TS5aRe0QROSf8CU3ckTiPQ...
What point are you trying to make here? If you're implying that "applications startup time" that the article measured is more "syntetic test" than kernel compilation time you measuring, then this sounds odd. Because people start apps up more often than compile the kernel. In fact, compiation process includes starting up apps.
They're all in the same ballpark, except there's a write time hit for the one with zstd:1 on this particular setup (and the compression hit isn't consistent across all hardware or setups, it's case by case - and hence the proposal option for compression indicates applying it selectively to locations we know there's a benefit across the board). But also you can tell there's no read time (decompression) hit from this same data set.
It is nice to see, although I'm pretty surprised they all have the same performance, except the one with compression. Could it be because all files got cached in RAM? If you did test by doing `git clone` and then running the build, then I'm pretty sure it did. I don't know how it works when files are cached, but I wouldn't be surprised if a number of filesystem-specific paths would be skipped in this case.
Meanwhile, this is somewhere between embarrassing and comedy: https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems...
Hmmm, 21 seconds to launch GNOME Terminal with an NVMe and you aren't curious about what went wrong? Because obviously something is wrong. The measurement is wrong or the method is wrong or something in the setup is fouling things up. How do you get a fast result with SSD but then such a slow result with NVMe?
It makes no sense, but meh, we'll just publish that shit anyway! LOLZ! And that is how you light your credibility on fire, because you just don't give a crap about it.
You misread it, the NVMe startup time is 1.03sec. The 21.01sec. time is SATA 3.0 SSD. No need to swear.
Not to say it is not odd compared to other results, but we can only guess.
On my 9 year old laptop with a mere Samsung 840 EVO, barely under 1 second for GNOME Terminal to launch, following a reboot and login so this is not the result of caching. On my much newer HP Spectre with NVMe, under 0.5s to launch.
My methodology and metrology? I'm using the "one mississippi" method from finger click of the actual app icon to the time I see a cursor in the launched app. Not rocket science.
Good for you. But you're trying take take decision for all other peoples, so you need to take into account not everyone has NVMe or SSD. HDDs that many people are also using are much slower. This means your "1 second vs 0.5 second" can easily turn into "5 seconds vs 10 seconds" (and not necessarily linearly).
As a matter of fact, I have two Archlinux laptops on BTRFS with compression, both only have HDD. I've been using for 3-4 years BTRFS there I think, maybe more. I made use of BTRFS because I was hoping that using ZSTD would result in less IO. Well, now my overall experience is that it is not rare that systems starting to lag terribly, then I execute `grep "" /proc/pressure/*`, and see someone is hogging IO. Then I pop up `iotop -a` and see among various processes a `[btrfs-cleaner]` and `[btrfs-transacti]`. It may be because of defrag option, I'm not sure…
There are many btrfs threads. Those actually make it more performant. If you look at their total cpu time though, e.g. ps aux, you'll see it's really small compared to most anything else you might think is idle.
root 366 0.0 0.0 0 0 ? S Jun25 1:22 [btrfs-transacti] root 500 0.0 0.0 0 0 ? S Jun25 1:45 [irq/135-iwlwifi] dbus 538 0.0 0.0 271548 6968 ? S Jun25 1:13 dbus-broker --log 4 --controller 9 --machine-id ce3f1eade82d42bd891a8c15714b13cf --max-bytes 536870912 --m root 1328 0.0 0.1 1273476 10116 ? Sl Jun25 3:00 /opt/teamviewer/tv_bin/teamviewerd -d
There is in fact a WTF moment as a result of this partial listing and it's not btrfs.
BTW this is 2 days of uptime.
You misread me, I wasn't talking about CPU time, I was talking about IO.
On Sat, Jun 27, 2020 at 4:30 PM Konstantin Kharlamov hi-angel@yandex.ru wrote:
On Sat, 2020-06-27 at 12:42 -0600, Chris Murphy wrote:
https://docs.google.com/spreadsheets/d/1b-y2WVrQK4ijo1TS5aRe0QROSf8CU3ckTiPQ...
What point are you trying to make here? If you're implying that "applications startup time" that the article measured is more "syntetic test" than kernel compilation time you measuring, then this sounds odd. Because people start apps up more often than compile the kernel. In fact, compiation process includes starting up apps.
Developers are a target market for Fedora on the desktop. Developers compile locally. The point is compiling the same thing on different file systems shows, reproducibly, they're all the same ballpark.
They're all in the same ballpark, except there's a write time hit for the one with zstd:1 on this particular setup (and the compression hit isn't consistent across all hardware or setups, it's case by case - and hence the proposal option for compression indicates applying it selectively to locations we know there's a benefit across the board). But also you can tell there's no read time (decompression) hit from this same data set.
It is nice to see, although I'm pretty surprised they all have the same performance, except the one with compression. Could it be because all files got cached in RAM?
Reboot between each test. Each test gets a clean copy of the source to compile, setup prior to the reboot.
Meanwhile, this is somewhere between embarrassing and comedy: https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems...
Hmmm, 21 seconds to launch GNOME Terminal with an NVMe and you aren't curious about what went wrong? Because obviously something is wrong. The measurement is wrong or the method is wrong or something in the setup is fouling things up. How do you get a fast result with SSD but then such a slow result with NVMe?
It makes no sense, but meh, we'll just publish that shit anyway! LOLZ! And that is how you light your credibility on fire, because you just don't give a crap about it.
You misread it, the NVMe startup time is 1.03sec. The 21.01sec. time is SATA 3.0 SSD. No need to swear.
I'm sorry, I'm referring to the article as being not credible. The "you" is not directed at YOU. That's sloppy writing on my part.
I don't have an explanation for the enormous difference, just that I can't reproduce these numbers, and they don't make sense. The measurement method is wrong, or there's something pathological with the set up that's causing just btrfs to be really slow on SSDs. Could it be scheduler related? Maybe, no idea. At least on Fedora we are using different schedulers for NVMe and SSDs, even though these tests aren't on Fedora.
Not to say it is not odd compared to other results, but we can only guess.
That's my complaint is that it doesn't make sense, and it's not explored.
Random guy on a list (me) complains about Phoronix benchmarks = not news.
Good for you. But you're trying take take decision for all other peoples, so you need to take into account not everyone has NVMe or SSD. HDDs that many people are also using are much slower. This means your "1 second vs 0.5 second" can easily turn into "5 seconds vs 10 seconds" (and not necessarily linearly).
I'm not making any claims about sysroot on HDD.
You misread me, I wasn't talking about CPU time, I was talking about IO.
:P Clearly I should NOT get into discussions about benchmarks. I find them annoying, and mostly unhelpful, obviously.
On Sat, 2020-06-27 at 18:41 -0600, Chris Murphy wrote:
On Sat, Jun 27, 2020 at 4:30 PM Konstantin Kharlamov hi-angel@yandex.ru
Good for you. But you're trying take take decision for all other peoples, so you need to take into account not everyone has NVMe or SSD. HDDs that many people are also using are much slower. This means your "1 second vs 0.5 second" can easily turn into "5 seconds vs 10 seconds" (and not necessarily linearly).
I'm not making any claims about sysroot on HDD.
Okay, in this case, unless benchmarks prove BTRFS to be performant enough on HDD are provided, I suggest the proposal should be modified to exclude HDDs from being considered as a BTRFS target.
FWIW, I was just thinking about it, and I came up with example you may like which shows exactly why BTRFS is bad for HDD. Consider development process. It includes rewriting source files over and over: you do `git checkout foo` and files are overwritten, you change a file in text editor, and it gets overwritten. And since BTRFS is CoW, it will always write files to a new place. As result, after some time, if you try to build the project, it gonna take much longer time just because BTRFS has to read files from a bunch of different places, and HDD are really bad at this.
If you take a non-CoW FS, this problem doesn't exist by design.
* Konstantin Kharlamov:
FWIW, I was just thinking about it, and I came up with example you may like which shows exactly why BTRFS is bad for HDD. Consider development process. It includes rewriting source files over and over: you do `git checkout foo` and files are overwritten, you change a file in text editor, and it gets overwritten. And since BTRFS is CoW, it will always write files to a new place.
Editors that make a backup copy typically do not overwrite files in place. They rename the file to the backup location and then write the new file.
git checkout unlinks changed files first, before writing them anew from scratch.
A COW file system does not make a difference for these use cases because there is already COW at the application level.
The GNU assembler truncates the output object file first. On XFS, that triggers relocation to a new file system location as well, even if the output file size (or contents) does not change. So that scenario is essentially COW as well today.
On Thu, 2020-07-02 at 09:44 +0200, Florian Weimer wrote:
- Konstantin Kharlamov:
FWIW, I was just thinking about it, and I came up with example you may like which shows exactly why BTRFS is bad for HDD. Consider development process. It includes rewriting source files over and over: you do `git checkout foo` and files are overwritten, you change a file in text editor, and it gets overwritten. And since BTRFS is CoW, it will always write files to a new place.
Editors that make a backup copy typically do not overwrite files in place. They rename the file to the backup location and then write the new file.
git checkout unlinks changed files first, before writing them anew from scratch.
A COW file system does not make a difference for these use cases because there is already COW at the application level.
The GNU assembler truncates the output object file first. On XFS, that triggers relocation to a new file system location as well, even if the output file size (or contents) does not change. So that scenario is essentially COW as well today.
Per my understanding what happens when you write a new file and delete an old one is that a block that old file was taking gets freed.
Then, if you copy the file again, file system should find a free block to write this copy into. And this block likely would be the one that got freed previously.
So, well, it is indeed COW, but not the one BTRFS does. It's a COW that copies a file back and forth between two blocks :) This is kinda HDD-friendly COW :)
BTRFS on the other hand will not rewrite older block unless it's out of new ones.
On Thu, 2020-07-02 at 21:37 +0300, Konstantin Kharlamov wrote:
On Thu, 2020-07-02 at 09:44 +0200, Florian Weimer wrote:
- Konstantin Kharlamov:
FWIW, I was just thinking about it, and I came up with example you may like which shows exactly why BTRFS is bad for HDD. Consider development process. It includes rewriting source files over and over: you do `git checkout foo` and files are overwritten, you change a file in text editor, and it gets overwritten. And since BTRFS is CoW, it will always write files to a new place.
Editors that make a backup copy typically do not overwrite files in place. They rename the file to the backup location and then write the new file.
git checkout unlinks changed files first, before writing them anew from scratch.
A COW file system does not make a difference for these use cases because there is already COW at the application level.
The GNU assembler truncates the output object file first. On XFS, that triggers relocation to a new file system location as well, even if the output file size (or contents) does not change. So that scenario is essentially COW as well today.
Per my understanding what happens when you write a new file and delete an old one is that a block that old file was taking gets freed.
Then, if you copy the file again, file system should find a free block to write this copy into. And this block likely would be the one that got freed previously.
So, well, it is indeed COW, but not the one BTRFS does. It's a COW that copies a file back and forth between two blocks :) This is kinda HDD-friendly COW :)
BTRFS on the other hand will not rewrite older block unless it's out of new ones.
Just to clarify: I do not claim this is how ext4 or xfs works. This simplistic explanation is just something obvious regarding how a non-COW fs would work, but of course there can be reasons for them to behave differently. If someone knows better, they're welcome. What I do know though, is how a COW FS works, because I did work a little with ZFS at dayjob.
On Friday, June 26, 2020 7:42:25 AM MST Ben Cotton wrote:
https://fedoraproject.org/wiki/Changes/BtrfsByDefault
== Summary ==
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
== Owners ==
- Names: [[User:Chrismurphy|Chris Murphy]], [[User:Ngompa|Neal
Gompa]], [[User:Josef|Josef Bacik]], [[User:Salimma|Michel Alexandre Salim]], [[User:Dcavalca|Davide Cavalca]], [[User:eeickmeyer|Erich Eickmeyer]], [[User:ignatenkobrain|Igor Raits]], [[User:Raveit65|Wolfgang Ulbrich]], [[User:Zsun|Zamir SUN]], [[User:rdieter|Rex Dieter]], [[User:grinnz|Dan Book]], [[User:nonamedotc|Mukundan Ragavan]]
- Emails: chrismurphy@fedoraproject.org, ngompa13@gmail.com,
josef@toxicpanda.com, michel@michel-slm.name, dcavalca@fb.com, erich@ericheickmeyer.com, ignatenkobrain@fedoraproject.org, fedora@raveit.de, zsun@fedoraproject.org, rdieter@gmail.com, grinnz@gmail.com, nonamedotc@gmail.com
- Products: All desktop editions, spins, and labs
- Responsible WGs: Workstation Working Group, KDE Special Interest Group
== Detailed Description ==
Fedora desktop edition/spin variants will switch to using Btrfs as the filesystem by default for new installs. Labs derived from these variants inherit this change, and other editions may opt into this change.
The change is based on the installer's custom partitioning Btrfs preset. It's been well tested for 7 years.
'''''Current partitioning'''''<br /> <span style="color: tomato">vg/root</span> LV mounted at <span style="color: tomato">/</span> and a <span style="color: tomato">vg/home</span> LV mounted at <span style="color: tomato">/home</span>. These are separate file system volumes, with separate free/used space.
'''''Proposed partitioning'''''<br /> <span style="color: tomato">root</span> subvolume mounted at <span style="color: tomato">/</span> and <span style="color: tomato">home</span> subvolume mounted at <span style="color: tomato">/home</span>. Subvolumes don't have size, they act mostly like directories, space is shared.
'''''Unchanged'''''<br /> <span style="color: tomato">/boot</span> will be a small ext4 volume. A separate boot is needed to boot dm-crypt sysroot installations; it's less complicated to keep the layout the same, regardless of whether sysroot is encrypted. There will be no automatic snapshots/rollbacks.
If you select to encrypt your data, LUKS (dm-crypt) will be still used as it is today (with the small difference that Btrfs is used instead of LVM+Ext4). There is upstream work on getting native encryption for Btrfs that will be considered once ready and is subject of a different change proposal in a future Fedora release.
=== Optimizations (Optional) ===
The detailed description above is the proposal. It's intended to be a minimalist and transparent switch. It's also the same as was [[Features/F16BtrfsDefaultFs|proposed]] (and [https://lwn.net/Articles/446925/ accepted]) for Fedora 16. The following optimizations improve on the proposal, but are not critical. They are also transparent to most users. The general idea is agree to the base proposal first, and then consider these as enhancements.
==== Boot on Btrfs ====
- Instead of a 1G ext4 boot, create a 1G Btrfs boot.
- Advantage: Makes it possible to include in a snapshot and rollback
regime. GRUB has stable support for Btrfs for 10+ years.
- Scope: Contingent on bootloader and installer team review and
approval. blivet should use <code>mkfs.btrfs --mixed</code>.
==== Compression ====
- Enable transparent compression using zstd on select directories:
<span style="color: tomato">/usr</span> <span style="color: tomato">/var/lib/flatpak</span> <span style="color: tomato">~/.local/share/flatpak</span>
- Advantage: Saves space and significantly increase the lifespan of
flash-based media by reducing write amplification. It may improve performance in some instances.
- Scope: Contingent on installer team review and approval to enhance
anaconda to perform the installation using <code>mount -o compress=zstd</code>, then set the proper XATTR for each directory. The XATTR can't be set until after the directories are created via: rsync, rpm, or unsquashfs based installation.
==== Additional subvolumes ====
- <span style="color: tomato">/var/log/</span> <span style="color:
tomato">/var/lib/libvirt/images</span> and <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> will use separate subvolumes.
- Advantage: Makes it easier to excluded them from snapshots,
rollbacks, and send/receive. (Btrfs snapshotting is not recursive, it stops at a nested subvolume.)
- Scope: Anaconda knows how to do this already, just change the
kickstart to add additional subvolumes (minus the subvolume in <span style="color: tomato">~/</span>. GNOME Boxes will need enhancement to detect that the user home is on Btrfs and create <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> as a subvolume.
== Feedback ==
==== Red Hat doesn't support Btrfs? Can Fedora do this? ====
Red Hat supports Fedora well, in many ways. But Fedora already works closely with, and depends on, upstreams. And this will be one of them. That's an important consideration for this proposal. The community has a stake in ensuring it is supported. Red Hat will never support Btrfs if Fedora rejects it. Fedora necessarily needs to be first, and make the persuasive case that it solves more problems than alternatives. Feature owners believe it does, hands down.
The Btrfs community has users that have been using it for most of the past decade at scale. It's been the default on openSUSE (and SUSE Linux Enterprise) since 2014, and Facebook has been using it for all their OS and data volumes, in their data centers, for almost as long. Btrfs is a mature, well-understood, and battle-tested file system, used on both desktop/container and server/cloud use-cases. We do have developers of the Btrfs filesystem maintaining and supporting the code in Fedora, one is a Change owner, so issues that are pinned to Btrfs can be addressed quickly.
==== What about device-mapper alternatives? ====
dm-thin (thin provisioning): [[https://pagure.io/fedora-workstation/issue/152 Issue #152] still happens, because the installer won't over provision by default. It still requires manual intervention by the user to identify and resolve the problem. Upon growing a file system on dm-thin, the pool is over committed, and file system sizes become a fantasy: they don't add up to the total physical storage available. The truth of used and free space is only known by the thin pool, and CLI and GUI programs are unprepared for this. Integration points like rpm free space checks or GNOME disk-space warnings would have to be adapted as well.
dm-vdo: is not yet merged, and isn't as straightforward to selectively enable per directory and per file, as is the case on Btrfs using <code>chattr +c</code> on <span style="color: tomato">/var/lib/flatpaks/</span>.
Btrfs solves the problems that need solving, with few side effects or pitfalls for users. It has more features we can take advantage of immediately and transparently: compression, integrity, and IO isolation. Many Btrfs features and optimizations can be opted into selectively per directory or file, such as compression and nodatacow, rather than as a layer that's either on or off.
==== What about UI/UX and integration in the desktop? ====
If Btrfs isn't the default file system, there's no commitment, nor reason to work on any UI/UX integration. There are ideas to make certain features discoverable: selective compression; systemd-homed may take advantage of either Btrfs online resize, or near-term planned native encryption, which could make it possible to live convert non-encrypted homes to encrypted; and system snapshot and rollbacks.
Anaconda already has sophisticated Btrfs integration.
==== What Btrfs features are recommended and supported? ====
The primary goal of this feature is to be largely transparent to the user. It does not require or expect users to learn new commands, or to engage in peculiar maintenance rituals.
The full set of Btrfs features that is considered stable and enabled by default upstream will be enabled in Fedora. Fedora is a community project. What is supported within Fedora depends on what the community decides to put forward in terms of resources.
The upstream [https://btrfs.wiki.kernel.org/index.php/Status Btrfs feature status page].
==== Are subvolumes really mostly like directories? ====
Subvolumes behave like directories in terms of navigation in both the GUI and CLI, e.g. <code>cp</code>, <code>mv</code>, <code>du</code>, owner/permissions, and SELinux labels. They also share space, just like a directory.
But it is an incomplete answer.
A subvolume is an independent file tree, with its own POSIX namespace, and has its own pool of inodes. This means inode numbers repeat themselves on a Btrfs volume. Inodes are only unique within a given subvolume. A subvolume has its own st_dev, so if you use <code>stat FILE</code> it reports a device value referring to the subvolume the file is in. And it also means hard links can't be created between subvolumes. From this perspective, subvolumes start looking more like a separate file system. But subvolumes share most of the other trees, so they're not truly independent file systems. They're also not block devices.
== Benefit to Fedora ==
Problems Btrfs helps solve:
- Users running out of free space on either <span style="color:
tomato">/</span> or <span style="color: tomato">/home</span> [https://pagure.io/fedora-workstation/issue/152 Workstation issue #152] ** "one big file system": no hard barriers like partitions or logical volumes ** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware ** reflinks and snapshots are more efficient for use cases like containers (Podman supports both)
- Storage devices can be flaky, resulting in data corruption
** Everything is checksummed and verified on every read ** Corrupt data results in EIO (input/output error), instead of resulting in application confusion, and isn't replicated into backups and archives
- Poor desktop responsiveness when under pressure
[https://pagure.io/fedora-workstation/issue/154 Workstation issue #154] ** Currently only Btrfs has proper IO isolation capability via cgroups2 ** Completes the resource control picture: memory, cpu, IO isolation
- File system resize
** Online shrink and grow are fundamental to the design
- Complex storage setups are... complicated
** Simple and comprehensive command interface. One master command ** Simpler to boot, all code is in the kernel, no initramfs complexities ** Simple and efficient file system replication, including incremental backups, with <code>btrfs send</code> and <code>btrfs receive</code>
== Scope ==
- Proposal owners:
** Submit PR's for Anaconda to change <code>default_scheme = BTRFS</code> to the proper product files. ** Multiple test days: build community support network ** Aid with documentation
- Other developers:
** Anaconda, review PRs and merge ** Bootloader team, review PRs and merge ** Recommended optimization <code>chattr +C</code> set on the containing directory for virt-manager and GNOME Boxes.
Release engineering: [https://pagure.io/releng/issue/9545 #9545]
Policies and guidelines: N/A
Trademark approval: N/A
== Upgrade/compatibility impact ==
Change will not affect upgrades.
Documentation will be provided for existing Btrfs users to "retrofit" their setups to that of a default Btrfs installation (base plus any approved options).
== How To Test ==
'''''Today'''''<br /> Do a custom partitioning installation; change the scheme drop-down menu to Btrfs; click the blue "automatically create partitions"; and install.<br /> Fedora 31, 32, Rawhide, on x86_64 and ARM.
'''''Once change lands'''''<br /> It should be simple enough to test, just do a normal install.
== User Experience ==
==== Pros ====
- Mostly transparent
- Space savings from compression
- Longer lifespan of hardware, also from compression.
- Utilities for used and free space, CLI and GUI, are expected to
behave the same. No special commands are required.
- More detailed information can be revealed by <code>btrfs</code>
specific commands.
==== Enhancement opportunities ====
[https://bugzilla.redhat.com/show_bug.cgi?id=906591 updatedb does not index /home when /home is a bind mount] Also can affected rpm-ostree installations, including Silverblue.
[https://gitlab.gnome.org/GNOME/gnome-usage/-/issues/49 GNOME Usage: Incorrect numbers when using multiple btrfs subvolumes] This isn't Btrfs specific, happens with "one big ext4" volume as well.
[https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/88 GNOME Boxes, RFE: create qcow2 with 'nocow' option when on btrfs /home] This is Btrfs specific, and is a recommended optimization for both GNOME Boxes and virt-manager.
[https://github.com/containers/libpod/issues/6563 containers/libpod: automatically use btrfs driver if on btrfs]
== Dependencies ==
None.
== Contingency Plan ==
Contingency mechanism: Owner will revert changes back to LVM+ext4
Contingency deadline: Beta freeze
Blocks release? Yes
Blocks product? Workstation and KDE
== Documentation ==
Strictly speaking no documentation is required reading for users. But there will be some Fedora documentation to help get the ball rolling.
For those who want to know more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page btrfs wiki main page and full feature list.]
<code>man 5 btrfs</code> contains: mount options, features, swapfile support, checksum algorithms, and more<br /> <code>man btrfs</code> contains an overview of the btrfs subcommands<br /> <code>man btrfs <nowiki><subcommand></nowiki></code> will show the man page for that subcommand
NOTE: The btrfs command will accept partial subcommands, as long as it's not ambiguous. These are equivalent commands:<br /> <code>btrfs subvolume snapshot</code><br /> <code>btrfs sub snap</code><br /> <code>btrfs su sn</code>
You'll discover your own convention. It might be preferable to write out the full command on forums and lists, but then maybe some folks don't learn about this useful shortcut?
For those who want to know a lot more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation Btrfs developer documentation]<br /> [https://github.com/btrfs/btrfs-dev-docs/blob/master/trees.txt Btrfs trees]
== Release Notes == The default file system on the desktop is Btrfs.
Since this one is just the default, when you haven't picked your own partitioning scheme, while I disagree with moving to btrfs and all of the instability and other issues associated, this is an okay default. This doesn't effect anyone who just clicks "Custom partitioning". Besides, it's better than XFS.
I highly oppose to this change, Btrfs was just an empty promise from the start to be the future of linux filesystems but never did get there, it's slower than traditional filesystems and I found it buggy.
I'm betting on bcachefs as the next gen fs, no one seem to have mentioned that here, I think once it gets to the point of being mainlined it will have the features of btrfs and the speed of traditional fs, to quote a recent comment from Kent Overstreet: "the phoronix numbers don't seem to align at all with how bcachefs feels in actual use; I and plenty of of other users find common operations tend to run quite a bit faster on bcachefs than other filesystems. Which is not to say the phoronix numbers are wrong - but they don't seem to be very representative in a way that hits bcachefs more than other filesystems."
Instead of adopting Btrfs, I would wait for bcachefs.
On Fri, Jun 26, 2020 at 2:46 PM Ben Cotton bcotton@redhat.com wrote:
A few claims (without justification):
There is no "average" Fedora user.
There is no "average" Fedora system.
There is also no "average" workload on the (non) average system (some people only use Firefox to access a few web sites, some run computational fluid dynamics-Monte Carlo simulations on $20K workstations with petabytes of disk storage).
The FAANGs are valuable for having both real world experience at the extremes, and also the resources to dig deep when necessary, but while their experience is indicative, they are also not "average" use cases.
I, personally, have within 10 feet of me in my residence systems using vfat, ext3/4, xfs, f2fs, jfs, btrfs, each of which were historically chosen/tuned for their *specific* use cases and requirements (although some of the systems, and/or the specific filesystems, are on their way out when I have a bit of time, but ENOTIME).
However, those that are (self identified) advanced users (mostly those commenting here) should be sufficiently competent to choose what their created filesystems should be for a new install, based on their *specific* needs, so they, too, are not average use cases.
What I think is important is what a new, initial, non-expert Fedora user will find the best experience.
Conclusion (again, without substantial justification):
And I accept that it is hard to challenge the reality that that is currently btrfs.
I am in favor of the proposal.
Thank you for asking.
On 6/27/2020 7:32 PM, Gary Buhrmaster wrote:
On Fri, Jun 26, 2020 at 2:46 PM Ben Cotton bcotton@redhat.com wrote:
A few claims (without justification):
There is no "average" Fedora user.
There is no "average" Fedora system. fedoraproject.org
Indeed not.
For what it's worth, I started with ZFS on Fedora, and got a little irritated that it wasn't ready for 32 when 32 was released. So on the advice of a friend, I migrated my ZFS volumes to BTRFS. They've been stable and reliable for me. I'm not doing anything crazy - just some subvolumes on single disks. ZFS was pretty heavy for that, and I didn't enjoy the complexity of managing dkms drivers and the odd edge cases it brings with it. Meanwhile, BTRFS does the things that I wanted to be able to do with ZFS. Nothing I couldn't do with LVM, but I wanted to try BTRFS since I'd heard good things about it and wanted to try it. It has worked well for me.
Maybe in a perfect world, Oracle would change licensing for ZFS and it could be included in the kernel proper, but that seems unlikely at this point.
We've got a bunch of people to support this proposal; it's straightforward to reverse if things go badly. It won't affect upgrades. It doesn't sound like the default layout is going to expose users to the known risks of BTRFS, and there will definitely be advantages.
I'm in favor of making this change as well.
Thanks,
Marty
As a Fedora user I would like to throw my 2 cents in from my personal experience and say I support this proposal.
I've used btrfs on a home NAS since Fedora 25 (now running F32) without any issues. The ability to resize the filesystem and change RAID levels has been super beneficial but the ability to detect bit rot was really the deciding factor especially when storing valuable files like family photos.
In addition to the home NAS I also run btrfs on my laptops and desktops without any noticeable performance impact compared to ext4 and XFS. The ability to take snapshots has been really nice to recover from the accidental mishap like deleting the wrong file.
--- Kevin Anderson
Hello,
I'm in total opposition to this proposal as a long-time Fedora user. The btrfs is unstable and not ready for production. Most of what I'm about to write is admittedly anecdotal but it's the only file system in Linux which has actively and regularly caused me to lose data on desktops, laptops, servers and even on mobile phones when I haven't taken precautions and done regular backups. Something I don't have to actively do when using ext4 in my workstations and notebooks.
This has happened to me because OpenSUSE and Jolla's Sailfish OS use btrfs as their default file system. I've tried using btrfs from time to time in various environments to see how it's progressing. However there hasn't been fixes for long-standing issues in btrfs when it comes to desktops and laptops in years. Btrfs can still for example run out of its automatically manager "metadata space" which it cannot recover from. Even the relatively recent improvement in kernel 5.8 have already been proven to not improve the situation much although at least the subvolume deletion failing over lack of disk space is now handled slightly better.
You could probably just ask the issue statistics from OpenSUSE and SUSE to see how unreliable btrfs is in reality. I hypothesise that a large majority of OpenSUSE users don't actually use the supposed default file system of their the distro and instead opt to use zfs, xfs and ext4.
I'm honestly in shock that this is even a discussion right now again. If there is a legitimate urgent need to switch the default file system for desktop and laptop users (and I understand why there is pressure to do so since ext4 has a number of shortcomings), then whatever legal obstacles there are blocking the use of zfs should be cleared and zfs should be used instead. Canonical with their Ubuntu is already trying to do this through use of OpenZFS. The xfs has started to have issues as of late but even it would be a legitimate choice.
The absolute first issue with btrfs in desktops and laptops is that it requires active conscious maintenance from the end-users to avoid large number of potentially disastrous situations as well as unconscious regular automatic constant maintenance on background which consume the disks and eat resources. Based on my experiences btrfs works best when you don't use the features you supposedly install it for. It's snapshots are a great example of that. Which is why I suspect that most btrfs "success stories" are ones where the users don't take advantage of the btrfs' features or have actively turned them off conscious of issues they bring up later on. Using btrfs doesn't make using PC easier and instead does the opposite by adding more work. Meanwhile zfs has reliable and working snapshots feature which is in actual use.
With btrfs the following is a very common situation: It's not too uncommon for users to have their entire disks full or near full. Okey, users will then delete some files, maybe few large applications, but in btrfs that is often not enough. User has to manually then run btrfs-balance operation with filters and it usually resolves the situation but it will start happening more frequently until it's completely unsolvable for the end-user without major external assistance or them performing a reformat.
And what inevitably happens with btrfs root volume is that the system can and will stop booting after period of "strange behaviour". Sometimes it can be resolved in maintenance mode but usually the end-user then has to boot a live environment, chroot their system, and clear all hopefully backup'd large files if the system is not in read-only (or clear that obstacle first), clear (most) snapshots, run btrfs-balance operation and do it very carefully or the entire file system might be lost. This will take a very long-time (ranging from 30 minutes to some hours and up to 3-4 days based on my experiences) even on a relatively small SSDs (not to mention HDDs) and it also will shorten SSD lifespan.
If laptop is put into sleep mode without users noticing that btrfs is running maintenance ops on background (and it often is), the likelihood that file system will get corrupt goes up the roof. Something users can do is use TLP and as a first aid set SATA_LINKPWR_ON_BAT=max_performance for TLP which then will shorten the amount of time laptop can be used without recharging. And this has been a standing issue at least since 2015 with no real fix on sight other than "lol, stop using btrfs" like one commentator at Reddit wrote.
The btrfs-check is also a massive can of worms and it cannot be safely run. At least not without reading pages upon pages of manual and becoming an expert in understanding how btrfs works. Expecting every Fedora end-user to do this is unrealistic in many different ways.
The btrfs has no native encryption to my knowledge. However alternatives such as zfs already has a trusted and reliable encryption used in numerous FreeNAS installations around the world.
And much of these issues and many more are straight up mentioned in btrfs' own wiki pages at kernel.org where one of the most shocking admissions is: "So, in general, it is impossible to give an accurate estimate of the amount of free space on any btrfs filesystem. Yes, this sucks."
Source: https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complicated...
And these are the brains before btrfs admitting this that there is no solution for this. No amount of userspace tools developmen and UX/DE integration is going to solve this for the end-users.
Please, don't switch to btrfs. It is not mature. It is not well-understood. It is not properly "battle-tested". It can still die on its own. It's just a ridiculous meme file system. At this point it would take me some decade of smooth sailing at OpenSUSE side to start believing that btrfs is ready for prime time in my own personal Fedora systems. Even 5 years of smooth sailing would give more faith in it. But as it stands I have to strongly oppose btrfs. It's too much of a headache with no relief in-sight.
Yes, BtrFs was very unstable, but before. Every software has this process. I have talked to one of the maintainer of BtrFs, she thinks that BtrFs is ready to production usage. (few years before, she is strongly against using BtrFs for production purpose).
But after all, this is an open-topic we should talk about, is BtrFs stable enough for users.
Yes, BtrFs was very unstable, but before. Every software has this process. I have talked to one of the maintainer of BtrFs, she thinks that BtrFs is ready to production usage. (few years before, she is strongly against using BtrFs for production purpose).
It's true that every piece of software has bugs and have to go through period of testing. This is especially true to file systems upon which the rest of the OS has to trust to work. However for btrfs this has already lasted 10 years with no end in sight.
We only have ideologically driven push for taking btrfs everywhere and even to places where it makes absolutely no sense at all such as mobile phones. Also every time this conversation happens btrfs is supposedly "very stable" and "production-ready" except when things go wrong, and they will go wrong, at which point btrfs proponents tell us "it's an experimental file system and not production-ready yet, but will soon be (for your use case)" or "you weren't following the proper use policy for btrfs". It's also undeniable fact that btrfs has numerous bugs which can result data loss. Even this very month one of the developers of btrfs, Zygo Blaxell, wrote that:
"We have far too many real data loss bugs in btrfs already."
Source: https://lore.kernel.org/linux-btrfs/20200619050402.GN10769@hungrycats.org/
...and how they don't want to deal with problems which aren't an actual proven issues. Furthermore since we have this whole debate going on, it is little amazing that not many if anyone at all has mentioned NILFS2. Even I only remembered it afterwards.
https://en.wikipedia.org/wiki/NILFS
This is yet another B-tree file system but I have far more trust to it than btrfs because Nippon Telegraph and Telephone Corporation (NTT) is behind NILFS and there is an actual real-world data backing its speed and reliability. It is demonstrably on average more performant than Ext4 on desktops at least since one of its objectives is to be low latency file system.
But after all, this is an open-topic we should talk about, is BtrFs stable enough for users.
Yes, we can and we should have a discussion on this as a community. But I just have to politely disagree about btrfs being stable enough for most users. I honestly cannot recommend btrfs for desktop and laptop users and that is what this proposal is about. For servers there are some benefits of using btrfs but even then the zfs or nilfs2 would server them better.
On Sun, Jun 28, 2020 at 5:56 AM Antti antti.aspinen@gmail.com wrote:
"We have far too many real data loss bugs in btrfs already."
Source: https://lore.kernel.org/linux-btrfs/20200619050402.GN10769@hungrycats.org/
Zygo Blaxell is a linux-btrfs@ list and #btrfs channel regular, and contributor to the kernel. And you are taking his quote out of the context, which is in the message you quote: the false / unsupported claim of data loss.
That context includes that he trusts Btrfs more than other file systems, and he's been using Btrfs in production all day long for a long time now. By all means go ask him yourself what he thinks.
-- Chris Murphy
And you are taking his quote out of the context, which is in the message you quote: the false / unsupported claim of data loss.
That context includes that he trusts Btrfs more than other file systems, and he's been using Btrfs in production all day long for a long time now. By all means go ask him yourself what he thinks.
I'm not sure if you understood me correctly since you seem to misunderstand the full context yourself. The context is explained in the message but allow me to summarize it here. They're talking about a fictional scenario where inexperienced openSUSE user "Joe Bloggs" wants to make more room for himself and finds an utility from a wiki which enables him to do this. The claim is that the use of btrfs-dedupe utility could potentially lead to data loss for this imagined inexperienced user "Joe". And Blaxell quite correctly writes that:
Joe Bloggs will not lose any data from btrfs-dedupe. He'll waste his time and run out of disk space, and maybe switch filesystems due to frustration, but Joe will not lose any of his data.
Which I agree with. After that Blaxell also correctly states that:
btrfs-dedupe has not had new commits in years and no longer builds on today's Rust. Those facts alone would have been sufficient to justify removing it from the wiki. We have far too many real data loss bugs in btrfs already. There is no need to spread rumors about new ones just to push changes through.
If you carefully read through the whole message, it is obvious that Blaxell indeed does trust btrfs and probably does use it in production. And I think this is a very good thing. Developers of product x must have faith in what they're doing and use their own software.
However the most commendable thing he wrote here is the part where he honestly admits that they also do have many real data loss bugs in btrfs and wishes that people would not spread rumors of non-existential ones. I can testify that this is true even if it's anecdotal and as a developer I also share his wish that people would not spread rumours of non-existential issues. And I feel that way no matter what project it is.
My point being that there are unresolved major issues in btrfs which should be fixed before Fedora can even consider making btrfs the default file system. Issues which aren't present in alternatives to btrfs. I don't "hate" btrfs. I merely "dislike" it and this stems from my own negative experiences with it from having used it every now and then probably around since 2009–2010 when it was first introduced to the kernel and actively every day since 2014.
Best file systems are invisible to end-users. That's also where I've set the bar for when the file system is "ready for production" in desktop environments. It is my lowest expectation for a file system. And btrfs still falls short of this after many years of full corporate support. Yes, it has made huge progress since the early days I must write. I'm very impressed of some of the things btrfs can do when things work as intended. Furthermore nowadays I don't have to once a year do a full reformat of my Sailfish OS device because of btrfs which is absolute relief and concrete example that progress has been made.
But btrfs is still not invisible. Meaning that when I do use it I actively have to think about using btrfs-check, btrfs-balance, btrfs-filesystem, etc. every now and then and cannot just use the system without a worry of something breaking up. And as a person who likes Fedora, who wants more people using Fedora, I also worry about the user experience and how btrfs is going to change the life for users if it becomes the default choice.
-- Antti (Hopeakoski)
On Sun, Jun 28, 2020 at 1:25 PM Antti antti.aspinen@gmail.com wrote:
However the most commendable thing he wrote here is the part where he honestly admits that they also do have many real data loss bugs in btrfs and wishes that people would not spread rumors of non-existential ones.
I asked him about the "have" comment in that email today on #btrfs and he said the intent was "have had".
My point being that there are unresolved major issues in btrfs which should be fixed before Fedora can even consider making btrfs the default file system.
Can you be more specific about what major issues you think are unresolved?
But btrfs is still not invisible. Meaning that when I do use it I actively have to think about using btrfs-check, btrfs-balance, btrfs-filesystem, etc.
Why? Those are used in specific situations, they're not routinely needed. I do not baby sit my file systems with any of these. I don't ever balance my file systems. Running check isn't needed as some sort of precaution, if the file system mounts without complaint. For used and free space reporting I use 'df' and 'du' just like any other file system.
every now and then and cannot just use the system without a worry of something breaking up. And as a person who likes Fedora, who wants more people using Fedora, I also worry about the user experience and how btrfs is going to change the life for users if it becomes the default choice.
I expect the vast majority of users will benefit from the change. As stated elsewhere in these threads, I expect the pattern of problems to be different between Btrfs and ext4. But the idea there are no problems at all whatsoever under ext4 ignores all the benefits in the proposal, and runs for cover under merely what we are used to.
Yes, BtrFs was very unstable, but before. Every software has this process. I have talked to one of the maintainer of BtrFs, she thinks that BtrFs is ready to production usage. (few years before, she is strongly against using BtrFs for production purpose).
May I ask who was the person you talked to? I'm asking as the active maintainer of btrfs. I'm familiar who does what in the community and overall status so it would be of my community interest to know who is speaking on behalf of the project, without me having even a slightest idea who that could be.
If you don't want to disclose the name in public, feel free to respond in private.
Thanks.
David Sterba dave@jikos.cz 于2020年7月7日周二 下午6:09写道:
Yes, BtrFs was very unstable, but before. Every software has this process. I have talked to one of the maintainer of BtrFs, she thinks that BtrFs is ready to production usage. (few years before, she is strongly against using BtrFs for production purpose).
May I ask who was the person you talked to? I'm asking as the active maintainer of btrfs. I'm familiar who does what in the community and overall status so it would be of my community interest to know who is speaking on behalf of the project, without me having even a slightest idea who that could be.
I may have take it wrongly, she is an developer at SUSE but not a maintainer. Sorry for my mistake and that is only a personal opinion.
And that is a private talk.
If you don't want to disclose the name in public, feel free to respond in private.
Thanks.
[1] https://twitter.com/mawei_spoiler/status/1275692573999407108
And to add, since zfs can't be supported by Fedora by now, the only filesystem that can identify file corruption and bit flip in your memory. So, if btrfs is stable enough, this will be absolutely a benefit to users.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Sun, 2020-06-28 at 10:21 +0000, Antti wrote:
Hello,
I'm in total opposition to this proposal as a long-time Fedora user. The btrfs is unstable and not ready for production. Most of what I'm about to write is admittedly anecdotal but it's the only file system in Linux which has actively and regularly caused me to lose data on desktops, laptops, servers and even on mobile phones when I haven't taken precautions and done regular backups. Something I don't have to actively do when using ext4 in my workstations and notebooks.
This has happened to me because OpenSUSE and Jolla's Sailfish OS use btrfs as their default file system. I've tried using btrfs from time to time in various environments to see how it's progressing. However there hasn't been fixes for long-standing issues in btrfs when it comes to desktops and laptops in years. Btrfs can still for example run out of its automatically manager "metadata space" which it cannot recover from. Even the relatively recent improvement in kernel 5.8 have already been proven to not improve the situation much although at least the subvolume deletion failing over lack of disk space is now handled slightly better.
You could probably just ask the issue statistics from OpenSUSE and SUSE to see how unreliable btrfs is in reality. I hypothesise that a large majority of OpenSUSE users don't actually use the supposed default file system of their the distro and instead opt to use zfs, xfs and ext4.
In one of the other messages in this thread, people from openSUSE has said that they started to receive less bugreports when switched from btrfs(root)+xfs(home) to just btrfs(root+home).
I'm honestly in shock that this is even a discussion right now again. If there is a legitimate urgent need to switch the default file system for desktop and laptop users (and I understand why there is pressure to do so since ext4 has a number of shortcomings), then whatever legal obstacles there are blocking the use of zfs should be cleared and zfs should be used instead. Canonical with their Ubuntu is already trying to do this through use of OpenZFS. The xfs has started to have issues as of late but even it would be a legitimate choice.
What benefits zfs has over btrfs? Esp. when it is not part of the kernel, people in Fedora do not really have much experience (neither expertise in development).
The absolute first issue with btrfs in desktops and laptops is that it requires active conscious maintenance from the end-users to avoid large number of potentially disastrous situations as well as unconscious regular automatic constant maintenance on background which consume the disks and eat resources. Based on my experiences btrfs works best when you don't use the features you supposedly install it for. It's snapshots are a great example of that. Which is why I suspect that most btrfs "success stories" are ones where the users don't take advantage of the btrfs' features or have actively turned them off conscious of issues they bring up later on. Using btrfs doesn't make using PC easier and instead does the opposite by adding more work. Meanwhile zfs has reliable and working snapshots feature which is in actual use.
Snapshots in Btrfs are working quite well, what exact features you are referring here to? Also note that we are not planning to offer any features that are not stable. We are planning to offer what btrfs provides as default.
With btrfs the following is a very common situation: It's not too uncommon for users to have their entire disks full or near full. Okey, users will then delete some files, maybe few large applications, but in btrfs that is often not enough. User has to manually then run btrfs-balance operation with filters and it usually resolves the situation but it will start happening more frequently until it's completely unsolvable for the end-user without major external assistance or them performing a reformat.
And what inevitably happens with btrfs root volume is that the system can and will stop booting after period of "strange behaviour". Sometimes it can be resolved in maintenance mode but usually the end- user then has to boot a live environment, chroot their system, and clear all hopefully backup'd large files if the system is not in read-only (or clear that obstacle first), clear (most) snapshots, run btrfs-balance operation and do it very carefully or the entire file system might be lost. This will take a very long-time (ranging from 30 minutes to some hours and up to 3-4 days based on my experiences) even on a relatively small SSDs (not to mention HDDs) and it also will shorten SSD lifespan.
If laptop is put into sleep mode without users noticing that btrfs is running maintenance ops on background (and it often is), the likelihood that file system will get corrupt goes up the roof. Something users can do is use TLP and as a first aid set SATA_LINKPWR_ON_BAT=max_performance for TLP which then will shorten the amount of time laptop can be used without recharging. And this has been a standing issue at least since 2015 with no real fix on sight other than "lol, stop using btrfs" like one commentator at Reddit wrote.
Do you have some link to the bugreport about this?
The btrfs-check is also a massive can of worms and it cannot be safely run. At least not without reading pages upon pages of manual and becoming an expert in understanding how btrfs works. Expecting every Fedora end-user to do this is unrealistic in many different ways.
Well, I hope people don't run random commands from the internet? Myself I did not even need to run anything like that.
The btrfs has no native encryption to my knowledge. However alternatives such as zfs already has a trusted and reliable encryption used in numerous FreeNAS installations around the world.
Upstream is working on that.
And much of these issues and many more are straight up mentioned in btrfs' own wiki pages at kernel.org where one of the most shocking admissions is: "So, in general, it is impossible to give an accurate estimate of the amount of free space on any btrfs filesystem. Yes, this sucks."
Source:
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complicated...
And these are the brains before btrfs admitting this that there is no solution for this. No amount of userspace tools developmen and UX/DE integration is going to solve this for the end-users.
Please, don't switch to btrfs. It is not mature. It is not well- understood. It is not properly "battle-tested". It can still die on its own. It's just a ridiculous meme file system. At this point it would take me some decade of smooth sailing at OpenSUSE side to start believing that btrfs is ready for prime time in my own personal Fedora systems. Even 5 years of smooth sailing would give more faith in it. But as it stands I have to strongly oppose btrfs. It's too much of a headache with no relief in-sight.
-- Antti (Hopeakoski)
P.S. Sorry for this emotional nature of this message. But I really, really like my Fedora and I really, really dislike btrfs due past highly negative experiences with it (some of them happening to me as recently as last year). _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
- -- Igor Raits ignatenkobrain@fedoraproject.org
This has happened to me because OpenSUSE and Jolla's Sailfish OS use btrfs as their default file system. I've tried using btrfs from time to time in various environments to see how it's progressing. However there hasn't been fixes for long-standing issues in btrfs when it comes to desktops and laptops in years. Btrfs can still for example run out of its automatically manager "metadata space" which it cannot recover from. Even the relatively recent improvement in kernel 5.8 have already been proven to not improve the situation much although at least the subvolume deletion failing over lack of disk space is now handled slightly better.
It's `openSUSE` not `OpenSUSE`. The metadata issue was mentioned in this thread before multiple times too. It's rare, but it happens
I'm honestly in shock that this is even a discussion right now again. If there is a legitimate urgent need to switch the default file system for desktop and laptop users (and I understand why there is pressure to do so since ext4 has a number of shortcomings), then whatever legal obstacles there are blocking the use of zfs should be cleared and zfs should be used instead. Canonical with their Ubuntu is already trying to do this through use of OpenZFS. The xfs has started to have issues as of late but even it would be a legitimate choice.
I know of one brave soul that maintained zfs based Tumbleweed install, and planned to have that included in the repos, but that died after nth kernel update breaking it. I think rolling/semi-rolling distros cannot ship OpenZFS, without it being release-blocking for kernel updates, because people kinda rely on their filesystem working. We already kinda had that dilemma with Nvidia driver on openSUSE side, and decided that kernel might actually be more important than any third party software that doesn't work with the new kernel, so OpenZFS is entirely out of the equation
The absolute first issue with btrfs in desktops and laptops is that it requires active conscious maintenance from the end-users to avoid large number of potentially disastrous situations as well as unconscious regular automatic constant maintenance on background which consume the disks and eat resources. Based on my experiences btrfs works best when you don't use the features you supposedly install it for. It's snapshots are a great example of that. Which is why I suspect that most btrfs "success stories" are ones where the users don't take advantage of the btrfs' features or have actively turned them off conscious of issues they bring up later on. Using btrfs doesn't make using PC easier and instead does the opposite by adding more work. Meanwhile zfs has reliable and working snapshots feature which is in actual use.
Those are edge cases, on a single disk setup, you shouldn't need to run maintenance scripts nor have them run automatically. We have those running on openSUSE distros because as I mentioned, SUSE doesn't care about desktop, so those are inherited from how servers do it
And what inevitably happens with btrfs root volume is that the system can and will stop booting after period of "strange behaviour". Sometimes it can be resolved in maintenance mode but usually the end-user then has to boot a live environment, chroot their system, and clear all hopefully backup'd large files if the system is not in read-only (or clear that obstacle first), clear (most) snapshots, run btrfs-balance operation and do it very carefully or the entire file system might be lost. This will take a very long-time (ranging from 30 minutes to some hours and up to 3-4 days based on my experiences) even on a relatively small SSDs (not to mention HDDs) and it also will shorten SSD lifespan.
Generally you could use the following guide https://en.opensuse.org/SDB:BTRFS#How_to_repair_a_broken.2Funmountable_btrfs... since it's faster than guessing what to do. I assume btrfs-balance is guessing what to do, because it doesn't make sense in the scenario
The btrfs-check is also a massive can of worms and it cannot be safely run. At least not without reading pages upon pages of manual and becoming an expert in understanding how btrfs works. Expecting every Fedora end-user to do this is unrealistic in many different ways.
Linking again to a very short and very linear and very easy to follow set of step to recover btrfs volume, where the last step is `btrfs check --repair`, if everything else fails ;) https://en.opensuse.org/SDB:BTRFS#How_to_repair_a_broken.2Funmountable_btrfs... It's certainly not pages upon pages of manual, it's a short subsection of a longer Support DB article.
The btrfs has no native encryption to my knowledge. However alternatives such as zfs already has a trusted and reliable encryption used in numerous FreeNAS installations around the world.
Until then, you can use luks since that's how every other filesystem in Linux has done it. For that matter, FreeNAS's zfs uses geli (luks but bsd not linux) and not zfs' native encryption
And much of these issues and many more are straight up mentioned in btrfs' own wiki pages at kernel.org where one of the most shocking admissions is: "So, in general, it is impossible to give an accurate estimate of the amount of free space on any btrfs filesystem. Yes, this sucks."
Source: https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complica...
And these are the brains before btrfs admitting this that there is no solution for this. No amount of userspace tools developmen and UX/DE integration is going to solve this for the end-users.
The default partitioning will not be raid. This will be an issue with any filesystem with features like btrfs on raid.
LCP [Stasiek] https://lcp.world
On Sunday, June 28, 2020 3:21:17 AM MST Antti wrote:
Hello,
I'm in total opposition to this proposal as a long-time Fedora user. The btrfs is unstable and not ready for production. Most of what I'm about to write is admittedly anecdotal but it's the only file system in Linux which has actively and regularly caused me to lose data on desktops, laptops, servers and even on mobile phones when I haven't taken precautions and done regular backups. Something I don't have to actively do when using ext4 in my workstations and notebooks.
This has happened to me because OpenSUSE and Jolla's Sailfish OS use btrfs as their default file system. I've tried using btrfs from time to time in various environments to see how it's progressing. However there hasn't been fixes for long-standing issues in btrfs when it comes to desktops and laptops in years. Btrfs can still for example run out of its automatically manager "metadata space" which it cannot recover from. Even the relatively recent improvement in kernel 5.8 have already been proven to not improve the situation much although at least the subvolume deletion failing over lack of disk space is now handled slightly better.
You could probably just ask the issue statistics from OpenSUSE and SUSE to see how unreliable btrfs is in reality. I hypothesise that a large majority of OpenSUSE users don't actually use the supposed default file system of their the distro and instead opt to use zfs, xfs and ext4.
I'm honestly in shock that this is even a discussion right now again. If there is a legitimate urgent need to switch the default file system for desktop and laptop users (and I understand why there is pressure to do so since ext4 has a number of shortcomings), then whatever legal obstacles there are blocking the use of zfs should be cleared and zfs should be used instead. Canonical with their Ubuntu is already trying to do this through use of OpenZFS. The xfs has started to have issues as of late but even it would be a legitimate choice.
The absolute first issue with btrfs in desktops and laptops is that it requires active conscious maintenance from the end-users to avoid large number of potentially disastrous situations as well as unconscious regular automatic constant maintenance on background which consume the disks and eat resources. Based on my experiences btrfs works best when you don't use the features you supposedly install it for. It's snapshots are a great example of that. Which is why I suspect that most btrfs "success stories" are ones where the users don't take advantage of the btrfs' features or have actively turned them off conscious of issues they bring up later on. Using btrfs doesn't make using PC easier and instead does the opposite by adding more work. Meanwhile zfs has reliable and working snapshots feature which is in actual use.
With btrfs the following is a very common situation: It's not too uncommon for users to have their entire disks full or near full. Okey, users will then delete some files, maybe few large applications, but in btrfs that is often not enough. User has to manually then run btrfs-balance operation with filters and it usually resolves the situation but it will start happening more frequently until it's completely unsolvable for the end-user without major external assistance or them performing a reformat.
And what inevitably happens with btrfs root volume is that the system can and will stop booting after period of "strange behaviour". Sometimes it can be resolved in maintenance mode but usually the end-user then has to boot a live environment, chroot their system, and clear all hopefully backup'd large files if the system is not in read-only (or clear that obstacle first), clear (most) snapshots, run btrfs-balance operation and do it very carefully or the entire file system might be lost. This will take a very long-time (ranging from 30 minutes to some hours and up to 3-4 days based on my experiences) even on a relatively small SSDs (not to mention HDDs) and it also will shorten SSD lifespan.
If laptop is put into sleep mode without users noticing that btrfs is running maintenance ops on background (and it often is), the likelihood that file system will get corrupt goes up the roof. Something users can do is use TLP and as a first aid set SATA_LINKPWR_ON_BAT=max_performance for TLP which then will shorten the amount of time laptop can be used without recharging. And this has been a standing issue at least since 2015 with no real fix on sight other than "lol, stop using btrfs" like one commentator at Reddit wrote.
The btrfs-check is also a massive can of worms and it cannot be safely run. At least not without reading pages upon pages of manual and becoming an expert in understanding how btrfs works. Expecting every Fedora end-user to do this is unrealistic in many different ways.
The btrfs has no native encryption to my knowledge. However alternatives such as zfs already has a trusted and reliable encryption used in numerous FreeNAS installations around the world.
And much of these issues and many more are straight up mentioned in btrfs' own wiki pages at kernel.org where one of the most shocking admissions is: "So, in general, it is impossible to give an accurate estimate of the amount of free space on any btrfs filesystem. Yes, this sucks."
Source: https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complicated .3F
And these are the brains before btrfs admitting this that there is no solution for this. No amount of userspace tools developmen and UX/DE integration is going to solve this for the end-users.
Please, don't switch to btrfs. It is not mature. It is not well-understood. It is not properly "battle-tested". It can still die on its own. It's just a ridiculous meme file system. At this point it would take me some decade of smooth sailing at OpenSUSE side to start believing that btrfs is ready for prime time in my own personal Fedora systems. Even 5 years of smooth sailing would give more faith in it. But as it stands I have to strongly oppose btrfs. It's too much of a headache with no relief in-sight.
-- Antti (Hopeakoski)
P.S. Sorry for this emotional nature of this message. But I really, really like my Fedora and I really, really dislike btrfs due past highly negative experiences with it (some of them happening to me as recently as last year).
Another way to consider this would be that we can stop arguing against these changes, let the GNOME folks run the ship aground, and hope that the user backlash will act as a wakeup call when it comes to these changes. I agree that btrfs is far too unstable to be made a default, and I also agree that ZFS would be a much better option. However, there is always going to be pushback on ZFS. If you want the best, there's a price to pay, and that's licensing headaches in this case.
In the end, it doesn't really matter what we say. All of the arguments in this thread are likely to be ignored by FESCo, as they have in other recent change "proposals" (more like change announcements, in this case). So, perhaps we should just watch this fail, and use that failure to push a sane default in the next release.
Another way to consider this would be that we can stop arguing against these changes, let the GNOME folks run the ship aground, and hope that the user backlash will act as a wakeup call when it comes to these changes. I agree that btrfs is far too unstable to be made a default, and I also agree that ZFS would be a much better option. However, there is always going to be pushback on ZFS. If you want the best, there's a price to pay, and that's licensing headaches in this case.
I understand you but I'd like to help btrfs guys to get their stuff working. And for two days now I've tried to write a reasonable honest truthful reply for their questions backed by facts and confirmed data unable to come up with concise answers. After following this topic it became clear to me that I'm not sufficiently prepared to give a proper technical presentation of my issues or to have an in-depth discussion of btrfs while they are very well prepared to defend their position. This is happening so suddenly too. I didn't expect Fedora to start considering this at all because Red Hat isn't at least publicly discussing it. I'd also like to avoid writing massive dozen page emails about my personal issues with btrfs when the central question here is if btrfs is good enough for majority of Fedora's user base. It could be even if it isn't ready for my use.
I can only offer descriptions of symptoms of trouble from the web back-end developer / desktop end-user PoW which starts to appear in personal computers where I have used or currently do use btrfs if not full-time. I made a long list of these yesterday and only some of them can be linked to existing known issues which are yet to be fixed so I didn't send that list to Chris Murphy and Stasiek Michalski yet and might not do so. Not publicly at least. Some of the issues have been fixed but not yet present in openSUSE Leap 15.1 where I previously experienced just how broken btrfs can be at its worst and I don't have that particular setup right now to even test if these changes would aid me in upcoming openSUSE Leap 15.2 release. I just have to let my head cool down before trying btrfs full-time again in a year or two.
Furthermore some of the things the proponents of this change have written just throw me back into my chair because after all I've gone through with btrfs and after all the lost time I could have spent better producing code, I know what they're writing is simply not true. Or not true in my case and I have major disbelief regarding for example there being no need to run btrfs balance when on my ThinkPad T430 I know for a fact that btrfs constantly will start running out of disk space and the solutions to it only temporarily solve it through regular use of btrfs balance, disabling snapshots which tend to get corrupted anyway and fine tuning the file system. But then again I don't think they're lying and I don't want to accuse them of that. There are visibly big gaps in how btrfs is experienced by different people in different working environments on different hardware. Based on what I've read lately, btrfs seems to work at really big scales very well. Where it fails to work are smaller individual setups and small businesses. This makes it a controversial file system.
Like I explained in another message, btrfs to me is highly visible file system and a source of stress as I have to eventually babysit it which to me proves it is unstable and not production ready. And this variation in how btrfs is experienced by different people is perhaps just another sign of it not being production ready yet even if huge progress has been made recent times. That is a good reason not to make it the default choice in Fedora.
Yesterday I went through my past emails where I discuss btrfs with my colleagues. It was some years ago but I had a similar freezing issues back then as well as I had in 2019 and when asked about it btrfs supporters explained to me that my particular workload which involves collection of hundreds if not up to a thousand small C, Ruby, Python, PHP and C# files, hundred GBs of image data split into maybe approximately 1MB files, and a two large local databases is "poison to btrfs" to quote a friend of mine. It was recommended that I run JFS instead which I've yet to touch but now that I have time I should try it as well as other available options. I'm highly interested of testing nilfs and bcachefs right now.
Furthermore even if Fedora were to set the btrfs as a default, I wouldn't use it in my main PCs since btrfs doesn't enjoy my trust at all right now. However I would stop recommending Fedora to my friends and family because doing a custom partitioning in Anaconda using another file system is way too complex and difficult task to perform. It's much easier to just recommend Ubuntu or openSUSE where the partitioning using alternative file system is much much easier and clear cut operation to do.
If it was easy to choose e.g. plain lvm+ext4 or Stratis lvm+xfs instead of btrfs during Fedora installation like it is in openSUSE I probably wouldn't be in total opposition to this proposal. I still would be against it but I wouldn't be here writing these messages about this issue and expressing my opposition to this proposal. And it would have to be fixed first before making btrfs the default file system.
The zfs in my opinion isn't a perfect choice neither technologically or legally. However it is the best thing out there for people who want an usable production-ready advanced file system right now. The issues with zfs to me present themselves as easier to solve than the problems with btrfs. But I'm not a legal expert and it really is a shame that the licensing is an issue with zfs. It's a very good file system and it didn't need to be forcefully pushed to become a success story. This is opposite to btrfs since its proponents constantly seem to want forcibly push it to people who don't want it. That combined with its continued technical issues have turned my initial positive enthusiasm towards btrfs into a very deep skepticism of it and its promised capabilities.
I wouldn't count for there being a backlash. Usually Fedora users are very open to changes and used to living "near edge". If it is already been decided to make the btrfs the default and this is merely a formality then I'm hoping for the best case that they know what they're doing and that btrfs' very latest version is usable for long periods of time and works through several upgrade cycles without reformat even if last year it wasn't yet on openSUSE Leap 15.1 release for myself.
In the end, it doesn't really matter what we say. All of the arguments in this thread are likely to be ignored by FESCo, as they have in other recent change "proposals" (more like change announcements, in this case). So, perhaps we should just watch this fail, and use that failure to push a sane default in the next release.
Quite contrary, I have hopes that my opinion does matter like it did in the past with the questions about systemd and pulseaudio. At least I hope btrfs becoming the default can be delayed enough to ensure that if large number of problems do start to appear, people have at least an easy alternative, or "safety net" if you will, to choose during the Fedora install. Anaconda should be changed to allow an easy alternative to btrfs be chosen even if it is the initial proposed partitioning scheme for users. The btrfs change shouldn't come as a surprise for users who don't read the change logs either. I deeply care about Fedora and don't wish to see any kind of "riot". Especially if it is completely avoidable with some preparations.
I can only offer descriptions of symptoms of trouble from the web back-end developer / desktop end-user PoW which starts to appear in personal computers where I have used or currently do use btrfs if not full-time. I made a long list of these yesterday and only some of them can be linked to existing known issues which are yet to be fixed so I didn't send that list to Chris Murphy and Stasiek Michalski yet and might not do so. Not publicly at least. Some of the issues have been fixed but not yet present in openSUSE Leap 15.1 where I previously experienced just how broken btrfs can be at its worst and I don't have that particular setup right now to even test if these changes would aid me in upcoming openSUSE Leap 15.2 release. I just have to let my head cool down before trying btrfs full-time again in a year or two.
Leap 15.2 might be a good choice in this case, since it will suffer that mid-life kernel rebase of Leap 15. You kinda got me, because despite technically being a Leap developer, I don't use it, because I don't have any use for it anywhere outside of the parts I contribute to. My experience with Leap might therefore be limited. Whatever isn't being posted on Reddit, Bugzilla, Discord or Matrix about btrfs on Leap I am going to miss, because my entire experience of btrfs has been through interacting with Tumbleweed and derivatives (Kubic, MicroOS). Considering the schedule at which Fedora and Tumbleweed upgrade the kernels is closer, this should actually be a more fair comparison though.
Furthermore some of the things the proponents of this change have written just throw me back into my chair because after all I've gone through with btrfs and after all the lost time I could have spent better producing code, I know what they're writing is simply not true. Or not true in my case and I have major disbelief regarding for example there being no need to run btrfs balance when on my ThinkPad T430 I know for a fact that btrfs constantly will start running out of disk space and the solutions to it only temporarily solve it through regular use of btrfs balance, disabling snapshots which tend to get corrupted anyway and fine tuning the file system. But then again I don't think they're lying and I don't want to accuse them of that. There are visibly big gaps in how btrfs is experienced by different people in different working environments on different hardware. Based on what I've read lately, btrfs seems to work at really big scales very well. Where it fails to work are smaller individual setups and small businesses. This makes it a controversial file system.
Snapshots aren't a part of this proposal, and frankly they do require a little bit more UX work, since they tend to cause people to run out of space too, because we don't cap that well enough. You can make snapshots work in a way that won't annoy you, because there are ways to set them up correctly, but for that you shouldn't rely on openSUSE distros defaults in that regard. Also I doubt openSUSE distros are used on big scale very often, I can think of very few examples, but they certainly don't match the sheer amount of users we have on various communication channels otherwise.
If it was easy to choose e.g. plain lvm+ext4 or Stratis lvm+xfs instead of btrfs during Fedora installation like it is in openSUSE I probably wouldn't be in total opposition to this proposal. I still would be against it but I wouldn't be here writing these messages about this issue and expressing my opposition to this proposal. And it would have to be fixed first before making btrfs the default file system.
Which openSUSE do you mean, our custom partitioning is a nightmare, to the point that even YaST developers started to want to make it easier recently.
LCP [Stasiek] https://lcp.world
On Tue, Jun 30, 2020 at 10:30 AM Antti antti.aspinen@gmail.com wrote:
Another way to consider this would be that we can stop arguing against these changes, let the GNOME folks run the ship aground, and hope that the user backlash will act as a wakeup call when it comes to these changes. I agree that btrfs is far too unstable to be made a default, and I also agree that ZFS would be a much better option. However, there is always going to be pushback on ZFS. If you want the best, there's a price to pay, and that's licensing headaches in this case.
I understand you but I'd like to help btrfs guys to get their stuff working. And for two days now I've tried to write a reasonable honest truthful reply for their questions backed by facts and confirmed data unable to come up with concise answers. After following this topic it became clear to me that I'm not sufficiently prepared to give a proper technical presentation of my issues or to have an in-depth discussion of btrfs while they are very well prepared to defend their position. This is happening so suddenly too. I didn't expect Fedora to start considering this at all because Red Hat isn't at least publicly discussing it. I'd also like to avoid writing massive dozen page emails about my personal issues with btrfs when the central question here is if btrfs is good enough for majority of Fedora's user base. It could be even if it isn't ready for my use.
I can only offer descriptions of symptoms of trouble from the web back-end developer / desktop end-user PoW which starts to appear in personal computers where I have used or currently do use btrfs if not full-time. I made a long list of these yesterday and only some of them can be linked to existing known issues which are yet to be fixed so I didn't send that list to Chris Murphy and Stasiek Michalski yet and might not do so. Not publicly at least. Some of the issues have been fixed but not yet present in openSUSE Leap 15.1 where I previously experienced just how broken btrfs can be at its worst and I don't have that particular setup right now to even test if these changes would aid me in upcoming openSUSE Leap 15.2 release. I just have to let my head cool down before trying btrfs full-time again in a year or two.
There will likely be significant improvement with openSUSE Leap 15.2, as the kernel has been rebased from 4.12 to 5.4. With that rebase, virtually all the stabilization work done upstream that hasn't already been backported to the SUSE Linux Enterprise 15 kernel will be included.
That said, as one of the change owners, I *want* to know about your issues. I want to be able to solve them. We have an upstream Btrfs developer who wants to resolve issues people discover, and the only way we can is if we know about them and get details to pin them down and fix them. It's how this goes with any piece of software, really.
I've been using it for five years on desktops, VMs, and servers with no issues for at least the last three. But I am not so blind as to say that Btrfs is perfect. But there's nothing I can do about things I don't know about, and that's true for anything in open source.
You should feel free to file bug reports so that we can address them.
Furthermore some of the things the proponents of this change have written just throw me back into my chair because after all I've gone through with btrfs and after all the lost time I could have spent better producing code, I know what they're writing is simply not true. Or not true in my case and I have major disbelief regarding for example there being no need to run btrfs balance when on my ThinkPad T430 I know for a fact that btrfs constantly will start running out of disk space and the solutions to it only temporarily solve it through regular use of btrfs balance, disabling snapshots which tend to get corrupted anyway and fine tuning the file system. But then again I don't think they're lying and I don't want to accuse them of that. There are visibly big gaps in how btrfs is experienced by different people in different working environments on different hardware. Based on what I've read lately, btrfs seems to work at really big scales very well. Where it fails to work are smaller individual setups and small businesses. This makes it a controversial file system.
Like I explained in another message, btrfs to me is highly visible file system and a source of stress as I have to eventually babysit it which to me proves it is unstable and not production ready. And this variation in how btrfs is experienced by different people is perhaps just another sign of it not being production ready yet even if huge progress has been made recent times. That is a good reason not to make it the default choice in Fedora.
Yesterday I went through my past emails where I discuss btrfs with my colleagues. It was some years ago but I had a similar freezing issues back then as well as I had in 2019 and when asked about it btrfs supporters explained to me that my particular workload which involves collection of hundreds if not up to a thousand small C, Ruby, Python, PHP and C# files, hundred GBs of image data split into maybe approximately 1MB files, and a two large local databases is "poison to btrfs" to quote a friend of mine.
I'm sorry that you feel that way. For what it's worth, my personal workload is very similar to yours. While it is true that databases are normally "poison", I've worked around it by using "nodatacow" for those portions, and in the further past, I used to split that out as an XFS partition. At least for the last couple of years, I've had a pretty good time with doing things like package builds, compilation, and such.
However, I want it to be that we move towards making these kinds of "correct" decisions being made automatically without you having to think about it. For example, when you run the pgsql or mysql database init script we ship, it should probably set nodatacow for the folder if it's detected to be on btrfs.
Nothing about this is me wanting to make it a burden to use Fedora because of Btrfs. I want to use Btrfs to open doors to all kinds of interesting possibilities to make Fedora a first-class integrated desktop Linux experience that competes on the same footing as macOS and Windows.
It was recommended that I run JFS instead which I've yet to touch but now that I have time I should try it as well as other available options. I'm highly interested of testing nilfs and bcachefs right now.
Well, huh, I've not heard of a recommendation about JFS in a long time. For heavy I/O database workloads, I suggest XFS, though Btrfs can be made to work quite well for database workloads with stuff like nodatacow as I mentioned earlier.
NILFS (and NILFS2) are not usable on Fedora due to lack of SELinux xattr support, last I checked. BCacheFS is not in the mainline kernel and there doesn't appear to be much progress on a path to the kind of stability you seek, as the project still says that it is unstable and there doesn't seem to be a roadmap to get out of that state right now.
Furthermore even if Fedora were to set the btrfs as a default, I wouldn't use it in my main PCs since btrfs doesn't enjoy my trust at all right now. However I would stop recommending Fedora to my friends and family because doing a custom partitioning in Anaconda using another file system is way too complex and difficult task to perform. It's much easier to just recommend Ubuntu or openSUSE where the partitioning using alternative file system is much much easier and clear cut operation to do.
If it was easy to choose e.g. plain lvm+ext4 or Stratis lvm+xfs instead of btrfs during Fedora installation like it is in openSUSE I probably wouldn't be in total opposition to this proposal. I still would be against it but I wouldn't be here writing these messages about this issue and expressing my opposition to this proposal. And it would have to be fixed first before making btrfs the default file system.
It is actually quite easy to choose an alternative configuration if you want. When you go through Anaconda installation and go to storage, you can choose "Custom", and from there you have a drop-down list of partitioning schemes: plain, LVM, LVM-thin, and Btrfs. You can select any of those and have Anaconda do a default setup based on that. The current default is "LVM", and we're changing the default to "Btrfs". But it's straightforward to make this change yourself at install time.
In my experience, YaST is actually pretty hard to use to switch to alternative configurations, so I'm surprised you say that it's difficult in Anaconda but not in YaST.
The zfs in my opinion isn't a perfect choice neither technologically or legally. However it is the best thing out there for people who want an usable production-ready advanced file system right now. The issues with zfs to me present themselves as easier to solve than the problems with btrfs. But I'm not a legal expert and it really is a shame that the licensing is an issue with zfs. It's a very good file system and it didn't need to be forcefully pushed to become a success story. This is opposite to btrfs since its proponents constantly seem to want forcibly push it to people who don't want it. That combined with its continued technical issues have turned my initial positive enthusiasm towards btrfs into a very deep skepticism of it and its promised capabilities.
I want to address this point specifically from the context of usability as you brought it up for btrfs. Both btrfs and zfs require the same kind of hand-holding in various configurations. When using complex storage configurations (multi-disk, raid, etc.), you do need to regularly do scrubs to verify that everything stays sane. A scrub is what lets you validate that you haven't suffered issues across the disk boundaries. Now, in "simple" setups, you can typically avoid needing to do this. With what we're trying to do here, Btrfs should be transparent to the user.
That being said, documentation is something we are working on as the Change owners. I do not want people to feel helpless with this filesystem.
I wouldn't count for there being a backlash. Usually Fedora users are very open to changes and used to living "near edge". If it is already been decided to make the btrfs the default and this is merely a formality then I'm hoping for the best case that they know what they're doing and that btrfs' very latest version is usable for long periods of time and works through several upgrade cycles without reformat even if last year it wasn't yet on openSUSE Leap 15.1 release for myself.
This Change process exists precisely because so we can learn from the community to make our proposal better, or if it turns out to be unworkable, postpone or withdraw it.
-- 真実はいつも一つ!/ Always, there's only one truth!
That said, as one of the change owners, I *want* to know about your issues.
Yes, I understand. It's just that I believe that the burden of proof is on my shoulders to prove that I have this and that issue before making bug reports. The problem I often face with btrfs is that it is highly inconsistent with its behaviour and that makes filling bug reports with concrete evidence of an issue difficult. i should just make videos of the issues or something. Most problems also only start after several months of serious usage and cannot easily be replicated in systems with other systems unless they're of exactly same model with exactly same kind of disks made. This isn't a problem if you constantly reformat your disks when hopping between distros but if you just want to continuously upgrade it will become an issue or at least does on my machines.
For example btrfs has for a long-time had this issue where after several months and being maybe more than 75% of disk space being in use, that when run on SSDs, system can randomly stops reading from the file system, starts thinking and then eventually returns. With each freezing the condition gets worse and eventually the system is eternally stuck and power reset is required.
The way this happens for example if you open Gnome Shell application launcher several times in a row, then likelyhood that Gnome completely freezes for duration of some seconds up to one minute increases. I don't see this behaviour when using any other file system so I've attributed it to btrfs but I have no way of knowing if it is an actual issue in btrfs other than it stopped when disk gets formatted to anything else.
And also notice that I wrote "maybe 75% full" because there is no way to know the actual free disk space from just "df -h". There are chapters about this in btrfs FAQ pages that df lies about disk space when using btrfs since evaluating free disk space in btrfs system is a tricky and challenging task with no good solution in sight. This is why e.g. use of "btrfs fs usage /" is required together with other tools to have some idea of available disk space.
Well, huh, I've not heard of a recommendation about JFS in a long time. For heavy I/O database workloads, I suggest XFS, though Btrfs can be made to work quite well for database workloads with stuff like nodatacow as I mentioned earlier.
Yeah, that came out of an email which was written some years ago. I'm not planning on actively using anything else than ext4 or lvm+ext4 at the moment in my daily life. However as a result of the btrfs pushing I've started to look for alternatives if there was something what could improve my workflow. This includes testing out current state of JFS, BCacheFS, etc.
I wanted to address some of the things you guys wrote but after several days I found myself writing more and more about just one particular thing. That thing being partitioning setup phase in Anaconda. Especially relating to the user's ability to easily choose an alternative configuration and customise the partitioning during the setup phase. I'll try to condense the most important points here.
It is actually quite easy to choose an alternative configuration if you want. When you go through Anaconda installation and go to storage, you can choose "Custom", and from there you have a drop-down list of partitioning schemes: plain, LVM, LVM-thin, and Btrfs. You can select any of those and have Anaconda do a default setup based on that. The current default is "LVM", and we're changing the default to "Btrfs". But it's straightforward to make this change yourself at install time.
In my experience, YaST is actually pretty hard to use to switch to alternative configurations, so I'm surprised you say that it's difficult in Anaconda but not in YaST.
No offense but you're really out of touch when it comes to this issue. Unlike Fedora, openSUSE has one the best partitioning setup phases I know about. In Fedora it is not easy to choose an alternative configuration or clearly comprehend what it is going to do to users disks. It is because of the UI design gone bad. It's also due low usability of Anaconda. Fedora's partitioning is over-engineered and too clever for its own good and it is like an hack intended to patch previous bad design.
Ever since the times of around or even slightly before F21 things have gone bad with it. Thankfully blivet has been a real life-saver when it was introduced I think in F26 or F27 to Anaconda. Especially when you have two or more disks and just want to mount some partitions (e.g. /home) from one disk and reformat existing partitions (e.g. /root & /boot) on an another disk.
Issues with Fedora's partitioning include (but are not limited to): 1. too many confusing separate steps, 2. no clear overview of what is actively being done to disks, 3. weird UI element placements with confusing labels, 4. similarly named selections which are totally different in actual functionality.
When analyzing Fedora's partitioning with Jakob Nielsen's usability goals as the benchmark, partitioning in Fedora fails each of the five goals good (graphical) user interface should aim to be:
I. For first-time users it isn't easy to accomplish basic tasks and learn through experimenting with it.
II. It is not very efficient to use with users having mouse take round-trips around moon type distances on screen and clicking number of buttons and selections.
III. It fails the memorability goal since due its complex nature after a while you have to re-learn it again.
IV. It has high error state number and often leaves users unable to back off until they've give "correct answers" which is frustrating.
V. It ultimately it isn't very satisfying to use.
I also want to add that it makes assumptions that users know things they might in-fact not be aware of. Also help button doesn't aid them much. And the use of screen estate could be improved in a number of ways.
For example in particular with the issue III.: Why in "installation target" step, once you choose "custom" or "advanced custom" storage configuration, the button to move to custom partitioning step is called "done" and not "next" or "continue" when it is clear that additional steps will follow? And why it's illogically placed at the top left corner instead of bottom right.
This is inconsistent with Anaconda itself having just taught to user that to move to next screen from "language selection" step they need to click "continue" from bottom right corner. Most dialogs and wizards in Anaconda have their "accept", "next" and "cancel" actions in bottom right corner as well. But this is not true in "installation destination" screens. Furthermore in Microsoft Windows (which most users would be familiar with) the top left corner button tends to mean "return to previous step without saving changes" in install wizards. Also Anaconda fails to convey the difference of "custom" and "advanced custom" leaving users to try and see what they will get.
In the openSUSE YaST partitioning is fairly straight forward process. It also fails some of the Nielsen's usability goals but in less serious ways. When you come to YaST's partitioning page, it will automatically detect storage devices and then suggest an automatic partitioning scheme for user. In Fedora you have to specifically choose it to do that for you.
Also unlike Fedora, openSUSE will show you in text list of steps it will do if user chooses to accept the automatic partitioning. It isn't perfect but at least it's something. The openSUSE people could add more visual representation of how the HDD space is going to be sliced to improve this but this textual representation is great because it tells users step by step precisely what it would do if user accepts the suggested scheme.
First automatic suggestion in openSUSE tends to go wrong (much like it goes in Fedora) but that's ok, YaST gives you two buttons to deal with this. First is "guided partitioning" which is extremely useful. It simply asks you couple questions in fairly consistent guided install wizard such as which disks to use if you have more than one, which file system to use for root, if you want LVM or not, if you want separate /home partition, if you want a separate swap partition, and so on.
This alone makes it better for most new users than Fedora's back and forth way. With just three mouse clicks you've customised automatic partitioning and often with sane values. It's far easier to do plain ext4, lvm+xfs, btrfs or some other combination in YaST than in Anaconda. Because of this it doesn't really matter that openSUSE by default suggests to use btrfs since it's very easy to switch away from it. Furthermore the fact that you can constantly see a list of things it will do if you accept is great help. This is the way Anaconda should work as well.
You can even very easily customise the suggested partitioning by clicking the "expert partitioning" which then offers you to start from scratch or suggested partition layout. And not only is YaST's "advanced custom" partitioning remarkably more comprehensive, it is also more simple to use and understand. you will see a proper device tree and you will see a proper list of partitions. Users can sort them and instantly see if operations are going to be performed and which operations in particular.
Cherry on the top is that YaST can read recognise your existing Linux installation and read /etc/fstab and use that as a basis of suggested partitioning including the mount points. All this can be demonstrated using few screenshots as well if necessary. Fedora can learn a great lot from openSUSE's partitioning setup phase I think.
Finally just think about the instruction you wrote about. You inadvertently admit within those instructions that it is more complex process in Fedora to switch to alternative layout than it is in openSUSE often in Fedora requiring more mouse clicks, more UI navigation, better in-depth understanding of what it is going to do to your disks and faith that it will actually do what you asked it to do.
On Sat, Jul 11, 2020 at 5:55 AM Antti antti.aspinen@gmail.com wrote:
For example btrfs has for a long-time had this issue where after several months and being maybe more than 75% of disk space being in use, that when run on SSDs, system can randomly stops reading from the file system, starts thinking and then eventually returns. With each freezing the condition gets worse and eventually the system is eternally stuck and power reset is required.
This is not normal and not acceptable. It is unfortunately true that there is a disproportionate burden placed on those having problems no one else is having. And troubleshooting amounts to either poking it with a stick (try this! no, try this! ok, now try this!) or providing sufficiently detailed reproduction steps. And that's tedious too.
The way this happens for example if you open Gnome Shell application launcher several times in a row, then likelyhood that Gnome completely freezes for duration of some seconds up to one minute increases. I don't see this behaviour when using any other file system so I've attributed it to btrfs but I have no way of knowing if it is an actual issue in btrfs other than it stopped when disk gets formatted to anything else.
My suggestion for any such freeze/hang is to issue sysrq+t. This might not be easy to do at exactly the time of the hang, because the hang prevents it from being typed fast enough. (a) remote ssh session with sysrq+t typed out and ready to just hit enter (b) netconsole, same concept. Reproduce the problem and then hit enter. Then file a bug with 'journalctl -k -o short-monotonic > bug#_journal.txt' - likely the default dmesg buffer will be too small to hold everything but the journal will have it. That should expose the nature of the hang.
If kernel messages show there's a blocked task for 2 minutes, in that case it's better to use sysrq+w.
In this case it's not necessary to have extremely detailed reproduction steps, nor wait for someone to have a properly aged system to see what's going on.
And also notice that I wrote "maybe 75% full" because there is no way to know the actual free disk space from just "df -h". There are chapters about this in btrfs FAQ pages that df lies about disk space when using btrfs since evaluating free disk space in btrfs system is a tricky and challenging task with no good solution in sight. This is why e.g. use of "btrfs fs usage /" is required together with other tools to have some idea of available disk space.
In the single device case, 'df' is expected to tell the truth. In the multiple device case, it should still tell the truth, but can be confusing because it can't tell the whole truth. And for that, there is 'btrfs filesystem usage /mnt' which provides quite a lot more information, to the degree it can be confusing at first. But the single device case is really straight forward, I just use 'df' and 'du' most of the time unless for some reason I want more information.
Recent example of multiple device confusion: https://bugzilla.redhat.com/show_bug.cgi?id=1855174 https://lore.kernel.org/linux-btrfs/0326afd3-9e14-b682-30e7-1c8ae44813ea@lec...
All,
As a long time user of Fedora I have run into nearly all the same issues as mentioned below by Antti. btrfs is not ready to be default: https://news.ycombinator.com/item?id=14907771
On Sun, Jun 28, 2020 at 5:21 AM Antti antti.aspinen@gmail.com wrote:
Hello,
I'm in total opposition to this proposal as a long-time Fedora user. The btrfs is unstable and not ready for production. Most of what I'm about to write is admittedly anecdotal but it's the only file system in Linux which has actively and regularly caused me to lose data on desktops, laptops, servers and even on mobile phones when I haven't taken precautions and done regular backups. Something I don't have to actively do when using ext4 in my workstations and notebooks.
This has happened to me because OpenSUSE and Jolla's Sailfish OS use btrfs as their default file system. I've tried using btrfs from time to time in various environments to see how it's progressing. However there hasn't been fixes for long-standing issues in btrfs when it comes to desktops and laptops in years. Btrfs can still for example run out of its automatically manager "metadata space" which it cannot recover from. Even the relatively recent improvement in kernel 5.8 have already been proven to not improve the situation much although at least the subvolume deletion failing over lack of disk space is now handled slightly better.
You could probably just ask the issue statistics from OpenSUSE and SUSE to see how unreliable btrfs is in reality. I hypothesise that a large majority of OpenSUSE users don't actually use the supposed default file system of their the distro and instead opt to use zfs, xfs and ext4.
I'm honestly in shock that this is even a discussion right now again. If there is a legitimate urgent need to switch the default file system for desktop and laptop users (and I understand why there is pressure to do so since ext4 has a number of shortcomings), then whatever legal obstacles there are blocking the use of zfs should be cleared and zfs should be used instead. Canonical with their Ubuntu is already trying to do this through use of OpenZFS. The xfs has started to have issues as of late but even it would be a legitimate choice.
The absolute first issue with btrfs in desktops and laptops is that it requires active conscious maintenance from the end-users to avoid large number of potentially disastrous situations as well as unconscious regular automatic constant maintenance on background which consume the disks and eat resources. Based on my experiences btrfs works best when you don't use the features you supposedly install it for. It's snapshots are a great example of that. Which is why I suspect that most btrfs "success stories" are ones where the users don't take advantage of the btrfs' features or have actively turned them off conscious of issues they bring up later on. Using btrfs doesn't make using PC easier and instead does the opposite by adding more work. Meanwhile zfs has reliable and working snapshots feature which is in actual use.
With btrfs the following is a very common situation: It's not too uncommon for users to have their entire disks full or near full. Okey, users will then delete some files, maybe few large applications, but in btrfs that is often not enough. User has to manually then run btrfs-balance operation with filters and it usually resolves the situation but it will start happening more frequently until it's completely unsolvable for the end-user without major external assistance or them performing a reformat.
And what inevitably happens with btrfs root volume is that the system can and will stop booting after period of "strange behaviour". Sometimes it can be resolved in maintenance mode but usually the end-user then has to boot a live environment, chroot their system, and clear all hopefully backup'd large files if the system is not in read-only (or clear that obstacle first), clear (most) snapshots, run btrfs-balance operation and do it very carefully or the entire file system might be lost. This will take a very long-time (ranging from 30 minutes to some hours and up to 3-4 days based on my experiences) even on a relatively small SSDs (not to mention HDDs) and it also will shorten SSD lifespan.
If laptop is put into sleep mode without users noticing that btrfs is running maintenance ops on background (and it often is), the likelihood that file system will get corrupt goes up the roof. Something users can do is use TLP and as a first aid set SATA_LINKPWR_ON_BAT=max_performance for TLP which then will shorten the amount of time laptop can be used without recharging. And this has been a standing issue at least since 2015 with no real fix on sight other than "lol, stop using btrfs" like one commentator at Reddit wrote.
The btrfs-check is also a massive can of worms and it cannot be safely run. At least not without reading pages upon pages of manual and becoming an expert in understanding how btrfs works. Expecting every Fedora end-user to do this is unrealistic in many different ways.
The btrfs has no native encryption to my knowledge. However alternatives such as zfs already has a trusted and reliable encryption used in numerous FreeNAS installations around the world.
And much of these issues and many more are straight up mentioned in btrfs' own wiki pages at kernel.org where one of the most shocking admissions is: "So, in general, it is impossible to give an accurate estimate of the amount of free space on any btrfs filesystem. Yes, this sucks."
Source:
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complicated...
And these are the brains before btrfs admitting this that there is no solution for this. No amount of userspace tools developmen and UX/DE integration is going to solve this for the end-users.
Please, don't switch to btrfs. It is not mature. It is not well-understood. It is not properly "battle-tested". It can still die on its own. It's just a ridiculous meme file system. At this point it would take me some decade of smooth sailing at OpenSUSE side to start believing that btrfs is ready for prime time in my own personal Fedora systems. Even 5 years of smooth sailing would give more faith in it. But as it stands I have to strongly oppose btrfs. It's too much of a headache with no relief in-sight.
-- Antti (Hopeakoski)
P.S. Sorry for this emotional nature of this message. But I really, really like my Fedora and I really, really dislike btrfs due past highly negative experiences with it (some of them happening to me as recently as last year). _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Sun, Jun 28, 2020 at 6:21 AM Antti antti.aspinen@gmail.com wrote:
Hello,
I'm in total opposition to this proposal as a long-time Fedora user. The btrfs is unstable and not ready for production. Most of what I'm about to write is admittedly anecdotal but it's the only file system in Linux which has actively and regularly caused me to lose data on desktops, laptops, servers and even on mobile phones when I haven't taken precautions and done regular backups. Something I don't have to actively do when using ext4 in my workstations and notebooks.
We had similar excitement back when reiserfs was popular. My limited play with btrfs reminds me of reiserfs: feature-filled but fragile and unsuitable for "/" partitions.
On 2020-07-01 04:07, Nico Kadel-Garcia wrote:
On Sun, Jun 28, 2020 at 6:21 AM Antti antti.aspinen@gmail.com wrote:
Hello,
I'm in total opposition to this proposal as a long-time Fedora user. The btrfs is unstable and not ready for production. Most of what I'm about to write is admittedly anecdotal but it's the only file system in Linux which has actively and regularly caused me to lose data on desktops, laptops, servers and even on mobile phones when I haven't taken precautions and done regular backups. Something I don't have to actively do when using ext4 in my workstations and notebooks.
We had similar excitement back when reiserfs was popular. My limited play with btrfs reminds me of reiserfs: feature-filled but fragile and unsuitable for "/" partitions.
My experience with reiserfs has been very different. It was a wonderful filesystem, with journaling when nobody had it (ext3 did not exist, we only had ext2). It was able to raise the disk capacity of the disk (thanks to tail-packing) and the performance was great (I was able to immediately notice if the filesystem was ext2 or reiserfs, as soon as you deleted a big iso: reiserfs was immediate, thanks to extents vs blocks, I think).
I've never had filesystem corruption on reiserfs even on very hard workloads. A specific episode remains in my mind: I had rsync hardlink based backups on a server with software RAID disks. One day I decided to delete about one year of backups, at four per day it was around 1000 directories, each of them with 100,000 files (heavily cross hardlinked, of course). To try to get good parallelism out of the RAID and the elevator, I submitted 1000 rm commands in background and then realized that I was deleting 100,000,000 files with an enormous lock and refcount stress testing on the fs. After a few hours, operation completed, no issue. The same production server was switched years later to ext4, since continuing with a mostly world-forgotten reiserfs option had no point. After a few days with ext4... fs corruption and data loss, simply because of an online expansion that was nothing special on reiserfs. Turns out ext4 kernel/tools combination was bugged. I've been able to find my comments on bugzilla (year 2012). https://bugzilla.redhat.com/show_bug.cgi?id=852833#c25
including this: (BTW, ...spent the weekend restoring a couple terabytes from backups...) (As a tribute to justice in the world, I want to say, so that search engines index it, that this system has been in production since 2005 on reiserfs and abused and resized without any similar issues; it was rebuild recently on new hardware and switched to ext4 and now I really risked to lose the data, as the remote versioned backup is also (resized) ext4, and the remote-remote backup is also (resized) ext4. I'm not complaining, but reiserfs really has a much worse reputation than actually deserved).
If this change will be accepted I think need to modify anaconda partition dialog for BTRFS scheme. It has difficult and not obvious behavior when user want to change automatically created partition scheme and resize BTRFS volumes. See this bugreport https://bugzilla.redhat.com/show_bug.cgi?id=1851212
пт, 26 июн. 2020 г. в 17:46, Ben Cotton bcotton@redhat.com:
https://fedoraproject.org/wiki/Changes/BtrfsByDefault
== Summary ==
For laptop and workstation installs of Fedora, we want to provide file system features to users in a transparent fashion. We want to add new features, while reducing the amount of expertise needed to deal with situations like [https://pagure.io/fedora-workstation/issue/152 running out of disk space.] Btrfs is well adapted to this role by design philosophy, let's make it the default.
== Owners ==
- Names: [[User:Chrismurphy|Chris Murphy]], [[User:Ngompa|Neal
Gompa]], [[User:Josef|Josef Bacik]], [[User:Salimma|Michel Alexandre Salim]], [[User:Dcavalca|Davide Cavalca]], [[User:eeickmeyer|Erich Eickmeyer]], [[User:ignatenkobrain|Igor Raits]], [[User:Raveit65|Wolfgang Ulbrich]], [[User:Zsun|Zamir SUN]], [[User:rdieter|Rex Dieter]], [[User:grinnz|Dan Book]], [[User:nonamedotc|Mukundan Ragavan]]
- Emails: chrismurphy@fedoraproject.org, ngompa13@gmail.com,
josef@toxicpanda.com, michel@michel-slm.name, dcavalca@fb.com, erich@ericheickmeyer.com, ignatenkobrain@fedoraproject.org, fedora@raveit.de, zsun@fedoraproject.org, rdieter@gmail.com, grinnz@gmail.com, nonamedotc@gmail.com
- Products: All desktop editions, spins, and labs
- Responsible WGs: Workstation Working Group, KDE Special Interest Group
== Detailed Description ==
Fedora desktop edition/spin variants will switch to using Btrfs as the filesystem by default for new installs. Labs derived from these variants inherit this change, and other editions may opt into this change.
The change is based on the installer's custom partitioning Btrfs preset. It's been well tested for 7 years.
'''''Current partitioning'''''<br /> <span style="color: tomato">vg/root</span> LV mounted at <span style="color: tomato">/</span> and a <span style="color: tomato">vg/home</span> LV mounted at <span style="color: tomato">/home</span>. These are separate file system volumes, with separate free/used space.
'''''Proposed partitioning'''''<br /> <span style="color: tomato">root</span> subvolume mounted at <span style="color: tomato">/</span> and <span style="color: tomato">home</span> subvolume mounted at <span style="color: tomato">/home</span>. Subvolumes don't have size, they act mostly like directories, space is shared.
'''''Unchanged'''''<br /> <span style="color: tomato">/boot</span> will be a small ext4 volume. A separate boot is needed to boot dm-crypt sysroot installations; it's less complicated to keep the layout the same, regardless of whether sysroot is encrypted. There will be no automatic snapshots/rollbacks.
If you select to encrypt your data, LUKS (dm-crypt) will be still used as it is today (with the small difference that Btrfs is used instead of LVM+Ext4). There is upstream work on getting native encryption for Btrfs that will be considered once ready and is subject of a different change proposal in a future Fedora release.
=== Optimizations (Optional) ===
The detailed description above is the proposal. It's intended to be a minimalist and transparent switch. It's also the same as was [[Features/F16BtrfsDefaultFs|proposed]] (and [https://lwn.net/Articles/446925/ accepted]) for Fedora 16. The following optimizations improve on the proposal, but are not critical. They are also transparent to most users. The general idea is agree to the base proposal first, and then consider these as enhancements.
==== Boot on Btrfs ====
- Instead of a 1G ext4 boot, create a 1G Btrfs boot.
- Advantage: Makes it possible to include in a snapshot and rollback
regime. GRUB has stable support for Btrfs for 10+ years.
- Scope: Contingent on bootloader and installer team review and
approval. blivet should use <code>mkfs.btrfs --mixed</code>.
==== Compression ====
- Enable transparent compression using zstd on select directories:
<span style="color: tomato">/usr</span> <span style="color: tomato">/var/lib/flatpak</span> <span style="color: tomato">~/.local/share/flatpak</span>
- Advantage: Saves space and significantly increase the lifespan of
flash-based media by reducing write amplification. It may improve performance in some instances.
- Scope: Contingent on installer team review and approval to enhance
anaconda to perform the installation using <code>mount -o compress=zstd</code>, then set the proper XATTR for each directory. The XATTR can't be set until after the directories are created via: rsync, rpm, or unsquashfs based installation.
==== Additional subvolumes ====
- <span style="color: tomato">/var/log/</span> <span style="color:
tomato">/var/lib/libvirt/images</span> and <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> will use separate subvolumes.
- Advantage: Makes it easier to excluded them from snapshots,
rollbacks, and send/receive. (Btrfs snapshotting is not recursive, it stops at a nested subvolume.)
- Scope: Anaconda knows how to do this already, just change the
kickstart to add additional subvolumes (minus the subvolume in <span style="color: tomato">~/</span>. GNOME Boxes will need enhancement to detect that the user home is on Btrfs and create <span style="color: tomato">~/.local/share/gnome-boxes/images/</span> as a subvolume.
== Feedback ==
==== Red Hat doesn't support Btrfs? Can Fedora do this? ====
Red Hat supports Fedora well, in many ways. But Fedora already works closely with, and depends on, upstreams. And this will be one of them. That's an important consideration for this proposal. The community has a stake in ensuring it is supported. Red Hat will never support Btrfs if Fedora rejects it. Fedora necessarily needs to be first, and make the persuasive case that it solves more problems than alternatives. Feature owners believe it does, hands down.
The Btrfs community has users that have been using it for most of the past decade at scale. It's been the default on openSUSE (and SUSE Linux Enterprise) since 2014, and Facebook has been using it for all their OS and data volumes, in their data centers, for almost as long. Btrfs is a mature, well-understood, and battle-tested file system, used on both desktop/container and server/cloud use-cases. We do have developers of the Btrfs filesystem maintaining and supporting the code in Fedora, one is a Change owner, so issues that are pinned to Btrfs can be addressed quickly.
==== What about device-mapper alternatives? ====
dm-thin (thin provisioning): [[https://pagure.io/fedora-workstation/issue/152 Issue #152] still happens, because the installer won't over provision by default. It still requires manual intervention by the user to identify and resolve the problem. Upon growing a file system on dm-thin, the pool is over committed, and file system sizes become a fantasy: they don't add up to the total physical storage available. The truth of used and free space is only known by the thin pool, and CLI and GUI programs are unprepared for this. Integration points like rpm free space checks or GNOME disk-space warnings would have to be adapted as well.
dm-vdo: is not yet merged, and isn't as straightforward to selectively enable per directory and per file, as is the case on Btrfs using <code>chattr +c</code> on <span style="color: tomato">/var/lib/flatpaks/</span>.
Btrfs solves the problems that need solving, with few side effects or pitfalls for users. It has more features we can take advantage of immediately and transparently: compression, integrity, and IO isolation. Many Btrfs features and optimizations can be opted into selectively per directory or file, such as compression and nodatacow, rather than as a layer that's either on or off.
==== What about UI/UX and integration in the desktop? ====
If Btrfs isn't the default file system, there's no commitment, nor reason to work on any UI/UX integration. There are ideas to make certain features discoverable: selective compression; systemd-homed may take advantage of either Btrfs online resize, or near-term planned native encryption, which could make it possible to live convert non-encrypted homes to encrypted; and system snapshot and rollbacks.
Anaconda already has sophisticated Btrfs integration.
==== What Btrfs features are recommended and supported? ====
The primary goal of this feature is to be largely transparent to the user. It does not require or expect users to learn new commands, or to engage in peculiar maintenance rituals.
The full set of Btrfs features that is considered stable and enabled by default upstream will be enabled in Fedora. Fedora is a community project. What is supported within Fedora depends on what the community decides to put forward in terms of resources.
The upstream [https://btrfs.wiki.kernel.org/index.php/Status Btrfs feature status page].
==== Are subvolumes really mostly like directories? ====
Subvolumes behave like directories in terms of navigation in both the GUI and CLI, e.g. <code>cp</code>, <code>mv</code>, <code>du</code>, owner/permissions, and SELinux labels. They also share space, just like a directory.
But it is an incomplete answer.
A subvolume is an independent file tree, with its own POSIX namespace, and has its own pool of inodes. This means inode numbers repeat themselves on a Btrfs volume. Inodes are only unique within a given subvolume. A subvolume has its own st_dev, so if you use <code>stat FILE</code> it reports a device value referring to the subvolume the file is in. And it also means hard links can't be created between subvolumes. From this perspective, subvolumes start looking more like a separate file system. But subvolumes share most of the other trees, so they're not truly independent file systems. They're also not block devices.
== Benefit to Fedora ==
Problems Btrfs helps solve:
- Users running out of free space on either <span style="color:
tomato">/</span> or <span style="color: tomato">/home</span> [https://pagure.io/fedora-workstation/issue/152 Workstation issue #152] ** "one big file system": no hard barriers like partitions or logical volumes ** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware ** reflinks and snapshots are more efficient for use cases like containers (Podman supports both)
- Storage devices can be flaky, resulting in data corruption
** Everything is checksummed and verified on every read ** Corrupt data results in EIO (input/output error), instead of resulting in application confusion, and isn't replicated into backups and archives
- Poor desktop responsiveness when under pressure
[https://pagure.io/fedora-workstation/issue/154 Workstation issue #154] ** Currently only Btrfs has proper IO isolation capability via cgroups2 ** Completes the resource control picture: memory, cpu, IO isolation
- File system resize
** Online shrink and grow are fundamental to the design
- Complex storage setups are... complicated
** Simple and comprehensive command interface. One master command ** Simpler to boot, all code is in the kernel, no initramfs complexities ** Simple and efficient file system replication, including incremental backups, with <code>btrfs send</code> and <code>btrfs receive</code>
== Scope ==
- Proposal owners:
** Submit PR's for Anaconda to change <code>default_scheme = BTRFS</code> to the proper product files. ** Multiple test days: build community support network ** Aid with documentation
- Other developers:
** Anaconda, review PRs and merge ** Bootloader team, review PRs and merge ** Recommended optimization <code>chattr +C</code> set on the containing directory for virt-manager and GNOME Boxes.
Release engineering: [https://pagure.io/releng/issue/9545 #9545]
Policies and guidelines: N/A
Trademark approval: N/A
== Upgrade/compatibility impact ==
Change will not affect upgrades.
Documentation will be provided for existing Btrfs users to "retrofit" their setups to that of a default Btrfs installation (base plus any approved options).
== How To Test ==
'''''Today'''''<br /> Do a custom partitioning installation; change the scheme drop-down menu to Btrfs; click the blue "automatically create partitions"; and install.<br /> Fedora 31, 32, Rawhide, on x86_64 and ARM.
'''''Once change lands'''''<br /> It should be simple enough to test, just do a normal install.
== User Experience ==
==== Pros ====
- Mostly transparent
- Space savings from compression
- Longer lifespan of hardware, also from compression.
- Utilities for used and free space, CLI and GUI, are expected to
behave the same. No special commands are required.
- More detailed information can be revealed by <code>btrfs</code>
specific commands.
==== Enhancement opportunities ====
[https://bugzilla.redhat.com/show_bug.cgi?id=906591 updatedb does not index /home when /home is a bind mount] Also can affected rpm-ostree installations, including Silverblue.
[https://gitlab.gnome.org/GNOME/gnome-usage/-/issues/49 GNOME Usage: Incorrect numbers when using multiple btrfs subvolumes] This isn't Btrfs specific, happens with "one big ext4" volume as well.
[https://gitlab.gnome.org/GNOME/gnome-boxes/-/issues/88 GNOME Boxes, RFE: create qcow2 with 'nocow' option when on btrfs /home] This is Btrfs specific, and is a recommended optimization for both GNOME Boxes and virt-manager.
[https://github.com/containers/libpod/issues/6563 containers/libpod: automatically use btrfs driver if on btrfs]
== Dependencies ==
None.
== Contingency Plan ==
Contingency mechanism: Owner will revert changes back to LVM+ext4
Contingency deadline: Beta freeze
Blocks release? Yes
Blocks product? Workstation and KDE
== Documentation ==
Strictly speaking no documentation is required reading for users. But there will be some Fedora documentation to help get the ball rolling.
For those who want to know more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page btrfs wiki main page and full feature list.]
<code>man 5 btrfs</code> contains: mount options, features, swapfile support, checksum algorithms, and more<br /> <code>man btrfs</code> contains an overview of the btrfs subcommands<br /> <code>man btrfs <nowiki><subcommand></nowiki></code> will show the man page for that subcommand
NOTE: The btrfs command will accept partial subcommands, as long as it's not ambiguous. These are equivalent commands:<br /> <code>btrfs subvolume snapshot</code><br /> <code>btrfs sub snap</code><br /> <code>btrfs su sn</code>
You'll discover your own convention. It might be preferable to write out the full command on forums and lists, but then maybe some folks don't learn about this useful shortcut?
For those who want to know a lot more:
[https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation Btrfs developer documentation]<br /> [https://github.com/btrfs/btrfs-dev-docs/blob/master/trees.txt Btrfs trees]
== Release Notes == The default file system on the desktop is Btrfs.
-- Ben Cotton He / Him / His Senior Program Manager, Fedora & CentOS Stream Red Hat TZ=America/Indiana/Indianapolis _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
The master branch for cp now defaults to copy-on-write on filesystems that support reflinks, which should make copies more efficient if Fedora starts using btrfs: https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=25725f9d41735d176....
Dolphin and KIO also seem like they will start doing this: https://bugs.kde.org/show_bug.cgi?id=326880, https://invent.kde.org/frameworks/kio/commit/c2faaae697f11ee600989b67b440698....
Beyond these recent changes, there are many other reasons to use btrfs, such as that Podman has a btrfs driver that might make containers more efficient, that ostree makes limited use of reflinks when they are available, that many filesystem options can be changed and new features and better defaults used even after the filesystem was initially created, that resize operations can be done online, and that there are uniform checksums on all metadata blocks, giving guarantees against corruption.
XFS also has reflinks, but lacks many features of btrfs, and switching from ext4 to XFS would mean losing cgroup writeback. XFS would mean no transparent compression too.
Switching from ext4 to OpenZFS, even putting aside license concerns from Red Hat, risks kernel releases being delayed or Fedora not being able to release with recent kernels. It makes kernel updates in Fedora dependent on the OpenZFS community releasing new versions compatible with recent kernels fast enough. And this is a concern, because many upstream kernel maintainers indicated they have little interest in avoiding breaking OpenZFS or doing any extra work to get it to work. (See, notably, https://lkml.org/lkml/2019/1/10/733 and https://www.realworldtech.com/forum/?threadid=189711&curpostid=189841.) I appreciate that Fedora’s kernel maintainers release new kernels quickly and think this is something that works well in Fedora. Supporting any out-of-tree modules in Fedora repos, including filesystems, would endanger this.
Also, in general, I think it is not a good idea to use things that your upstreams are not interested in, do not want to support, and do not recommend using.
Staying on ext4 means not having reflinks, transparent compression, online resize, deduplication, strong guarantees against corruption, and that improved filesystem defaults or new features can be used only by recreating the filesystem and reinstalling Fedora. In consideration of that, I am favorable to the change proposal targeting btrfs in Fedora.
On Sunday, June 28, 2020 11:31:15 PM MST Mark Otaris wrote:
The master branch for cp now defaults to copy-on-write on filesystems that support reflinks, which should make copies more efficient if Fedora starts using btrfs: https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=25725f9d41735d176 d73a757430739fb71c7d043.
Dolphin and KIO also seem like they will start doing this: https://bugs.kde.org/show_bug.cgi?id=326880, https://invent.kde.org/frameworks/kio/commit/c2faaae697f11ee600989b67b440698 1838ae628.
Beyond these recent changes, there are many other reasons to use btrfs, such as that Podman has a btrfs driver that might make containers more efficient, that ostree makes limited use of reflinks when they are available, that many filesystem options can be changed and new features and better defaults used even after the filesystem was initially created, that resize operations can be done online, and that there are uniform checksums on all metadata blocks, giving guarantees against corruption.
XFS also has reflinks, but lacks many features of btrfs, and switching from ext4 to XFS would mean losing cgroup writeback. XFS would mean no transparent compression too.
Switching from ext4 to OpenZFS, even putting aside license concerns from Red Hat, risks kernel releases being delayed or Fedora not being able to release with recent kernels. It makes kernel updates in Fedora dependent on the OpenZFS community releasing new versions compatible with recent kernels fast enough. And this is a concern, because many upstream kernel maintainers indicated they have little interest in avoiding breaking OpenZFS or doing any extra work to get it to work. (See, notably, https://lkml.org/lkml/2019/1/10/733 and https://www.realworldtech.com/forum/?threadid=189711&curpostid=189841.) I appreciate that Fedora’s kernel maintainers release new kernels quickly and think this is something that works well in Fedora. Supporting any out-of-tree modules in Fedora repos, including filesystems, would endanger this.
Also, in general, I think it is not a good idea to use things that your upstreams are not interested in, do not want to support, and do not recommend using.
Staying on ext4 means not having reflinks, transparent compression, online resize, deduplication, strong guarantees against corruption, and that improved filesystem defaults or new features can be used only by recreating the filesystem and reinstalling Fedora. In consideration of that, I am favorable to the change proposal targeting btrfs in Fedora.
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
On Monday, June 29, 2020 12:18:28 AM MST Samuel Sieb wrote:
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
See the relevant section of Mark's email. I also don't see how it'd require Oracle to change anything in order to get OpenZFS into Fedora.
On 6/29/20 12:27 AM, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:18:28 AM MST Samuel Sieb wrote:
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
See the relevant section of Mark's email. I also don't see how it'd require Oracle to change anything in order to get OpenZFS into Fedora.
You were mentioning ZFS, not OpenZFS. However, it's still the same problem. OpenZFS is CDDL which won't be accepted. The only way that can be changed is if Oracle does something. And as long as OpenZFS is an out-of-tree module, it won't be in Fedora.
On Monday, June 29, 2020 12:32:56 AM MST Samuel Sieb wrote:
On 6/29/20 12:27 AM, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:18:28 AM MST Samuel Sieb wrote:
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
See the relevant section of Mark's email. I also don't see how it'd require Oracle to change anything in order to get OpenZFS into Fedora.
You were mentioning ZFS, not OpenZFS. However, it's still the same problem. OpenZFS is CDDL which won't be accepted. The only way that can be changed is if Oracle does something. And as long as OpenZFS is an out-of-tree module, it won't be in Fedora.
ZFS, in terms of Linux support, is generally OpenZFS. You will note that Mark also simply said "ZFS". Yes, OpenZFS is under CDDL. That's not really a problem. See https://www.softwarefreedom.org/resources/2016/linux-kernel-cddl.html. Ubuntu's solution is wouldn't work for us, and it is a GPL violation (https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/), but it's also not necessary. The package for OpenZFS could be provided as a kmod package instead, which would *not* be a GPL violation.
I don't understand the attitude against this particular out-of-tree module, as it's readily available for every kernel within days of release. The longest lulls have been around holidays, where it took up to 5 days to get support for the latest stable kernel.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
On Mon, 2020-06-29 at 00:37 -0700, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:32:56 AM MST Samuel Sieb wrote:
On 6/29/20 12:27 AM, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:18:28 AM MST Samuel Sieb wrote:
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
See the relevant section of Mark's email. I also don't see how it'd require Oracle to change anything in order to get OpenZFS into Fedora.
You were mentioning ZFS, not OpenZFS. However, it's still the same problem. OpenZFS is CDDL which won't be accepted. The only way that can be changed is if Oracle does something. And as long as OpenZFS is an out-of-tree module, it won't be in Fedora.
ZFS, in terms of Linux support, is generally OpenZFS. You will note that Mark also simply said "ZFS". Yes, OpenZFS is under CDDL. That's not really a problem. See https://www.softwarefreedom.org/resources/2016/linux-kernel-cddl.html . Ubuntu's solution is wouldn't work for us, and it is a GPL violation (https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/), but it's also not necessary. The package for OpenZFS could be provided as a kmod package instead, which would *not* be a GPL violation.
I don't understand the attitude against this particular out-of-tree module, as it's readily available for every kernel within days of release. The longest lulls have been around holidays, where it took up to 5 days to get support for the latest stable kernel.
First of all, Fedora is packaging not only latest stable kernel. Fedora is building kernel from git in rawhide almost daily. Secondly, kmods in Fedora are not allowed.
-- John M. Harris, Jr.
devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
- -- Igor Raits ignatenkobrain@fedoraproject.org
On Monday, June 29, 2020 12:54:02 AM MST Igor Raits wrote:
On Mon, 2020-06-29 at 00:37 -0700, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:32:56 AM MST Samuel Sieb wrote:
On 6/29/20 12:27 AM, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:18:28 AM MST Samuel Sieb wrote:
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
For the best filesystem ever created, ZFS, I can't say that I agree with your assessment of that value. Having ZFS in Fedora would throw Fedora over the top as being the best Linux distro, hands down. I can count the number of times that having root on ZFS has led to me waiting on kernel updates over the past three years on one hand, and could still do so if I had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
See the relevant section of Mark's email. I also don't see how it'd require Oracle to change anything in order to get OpenZFS into Fedora.
You were mentioning ZFS, not OpenZFS. However, it's still the same problem. OpenZFS is CDDL which won't be accepted. The only way that can be changed is if Oracle does something. And as long as OpenZFS is an out-of-tree module, it won't be in Fedora.
ZFS, in terms of Linux support, is generally OpenZFS. You will note that Mark also simply said "ZFS". Yes, OpenZFS is under CDDL. That's not really a problem. See https://www.softwarefreedom.org/resources/2016/linux-kernel-cddl.html . Ubuntu's solution is wouldn't work for us, and it is a GPL violation (https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/), but it's also not necessary. The package for OpenZFS could be provided as a kmod package instead, which would *not* be a GPL violation.
I don't understand the attitude against this particular out-of-tree module, as it's readily available for every kernel within days of release. The longest lulls have been around holidays, where it took up to 5 days to get support for the latest stable kernel.
First of all, Fedora is packaging not only latest stable kernel. Fedora is building kernel from git in rawhide almost daily. Secondly, kmods in Fedora are not allowed.
The Times They Are a-Changin'. It wouldn't be the first radical change in Fedora recently.
I don't see how building the kernel daily would be an issue here. Yes, it wouldn't work against some of them once every few months, and then it'd be fixed within a week. An exception could be made for this particular kmod, and it'd be well worth it for our users.
On Mon, Jun 29, 2020 at 4:16 AM John M. Harris Jr johnmh@splentity.com wrote:
On Monday, June 29, 2020 12:54:02 AM MST Igor Raits wrote:
On Mon, 2020-06-29 at 00:37 -0700, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:32:56 AM MST Samuel Sieb wrote:
On 6/29/20 12:27 AM, John M. Harris Jr wrote:
On Monday, June 29, 2020 12:18:28 AM MST Samuel Sieb wrote:
On 6/28/20 11:35 PM, John M. Harris Jr wrote:
> For the best filesystem ever created, ZFS, I can't say that I > agree > with > your assessment of that value. Having ZFS in Fedora would > throw Fedora > over the top as being the best Linux distro, hands down. I > can count > the > number of times that having root on ZFS has led to me waiting > on kernel > updates over the past three years on one hand, and could > still do so if > I > had half as many fingers!
How many times are you going to keep mentioning ZFS? It's completely off the table, not allowed, never happening. (I consider the chance of Oracle doing something reasonable to be immeasurably small.)
See the relevant section of Mark's email. I also don't see how it'd require Oracle to change anything in order to get OpenZFS into Fedora.
You were mentioning ZFS, not OpenZFS. However, it's still the same problem. OpenZFS is CDDL which won't be accepted. The only way that can be changed is if Oracle does something. And as long as OpenZFS is an out-of-tree module, it won't be in Fedora.
ZFS, in terms of Linux support, is generally OpenZFS. You will note that Mark also simply said "ZFS". Yes, OpenZFS is under CDDL. That's not really a problem. See https://www.softwarefreedom.org/resources/2016/linux-kernel-cddl.html . Ubuntu's solution is wouldn't work for us, and it is a GPL violation (https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/), but it's also not necessary. The package for OpenZFS could be provided as a kmod package instead, which would *not* be a GPL violation.
I don't understand the attitude against this particular out-of-tree module, as it's readily available for every kernel within days of release. The longest lulls have been around holidays, where it took up to 5 days to get support for the latest stable kernel.
First of all, Fedora is packaging not only latest stable kernel. Fedora is building kernel from git in rawhide almost daily. Secondly, kmods in Fedora are not allowed.
The Times They Are a-Changin'. It wouldn't be the first radical change in Fedora recently.
I don't see how building the kernel daily would be an issue here. Yes, it wouldn't work against some of them once every few months, and then it'd be fixed within a week. An exception could be made for this particular kmod, and it'd be well worth it for our users.
It is not acceptable that there is a range of time that people would literally not be able to mount their file systems because the kernel module would not build.
Fedora does not allow out of tree kernel modules to be packaged for the distribution. This has been the case since Fedora 7. The only way OpenZFS will become available in Fedora itself is if it gets a FUSE backend[1]. If that happens, I will happily package OpenZFS myself and replace zfs-fuse with it. Contrary to what you might think, I am extremely familiar with OpenZFS and I work with them quite often. The members of the OpenZFS Project know me and I know them. I've worked with them on various things for *years*.
That does not change the fact that OpenZFS is a very *special* out of tree kernel module that would put a major crimp in doing a lot of things Fedora does now, like testing and validating snapshots of the Linux kernel as it is being developed. Fedora is a place where we actively work with our upstreams, and we stay close to those projects as part of maintaining software for them. Having kzfs in Fedora would strain that immensely.
Unless you're willing to put in the effort to make an OpenZFS FUSE backend, stop bringing it up. It's not going to happen.
[1]: https://github.com/openzfs/zfs/issues/8
It is not acceptable that there is a range of time that people would literally not be able to mount their file systems because the kernel module would not build.
I would say that is a rather unlikely scenario to happen given how engaged the OpenZFS developers are in maintaining Linux kernel support, and also considering how many kernel developers there are that run Fedora. The time delay is more with respect to OpenZFS releases rather than having patches available that make OpenZFS work with the Linux kernel.
Fedora does not allow out of tree kernel modules to be packaged for the distribution. This has been the case since Fedora 7.
That is a strong argument. But obviously more a political rather than a technical one.
That does not change the fact that OpenZFS is a very *special* out of tree kernel module that would put a major crimp in doing a lot of things Fedora does now, like testing and validating snapshots of the Linux kernel as it is being developed. Fedora is a place where we actively work with our upstreams, and we stay close to those projects as part of maintaining software for them. Having kzfs in Fedora would strain that immensely.
Well, Fedora could become the platform where OpenZFS developers work closely with kernel developers. :)
All that said, I very well understand the hesitations of Fedora, and upstream kernel, developers to accommodate ZFS. I actually agree that in the current situation with licenses being what they are, and thus ZFS being an out-of-tree filesystem, it would not be wise to have ZFS as the default root file system in Fedora.
I personally have my /home filesystem on ZFS, and keep the root filesystem on an ext4 partition, as I am confident that I can reinstall Fedora in a reasonable amount of time, but I care about the data in my home/working directories and value immensely ZFS features with respect to data integrity and backups.
Regarding the current proposal at hand, i.e. making btrfs the default filesystem, I am actually in favour of that change. The next generation filesystems (i.e. btrfs and ZFS) have many desirable features ([1] lists a number of them, and that article is already quite old) and it's about time to switch also the desktop system to these filesystem IMHO.
Just my two cents.
-Armin
[1] https://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cow...
On 29 June 2020 17:36:15 CEST, Armin Wehrfritz dkxls23@gmail.com wrote:
It is not acceptable that there is a range of time that people would literally not be able to mount their file systems because the kernel module would not build.
I would say that is a rather unlikely scenario to happen given how engaged the OpenZFS developers are in maintaining Linux kernel support, and also considering how many kernel developers there are that run Fedora. The time delay is more with respect to OpenZFS releases rather than having patches available that make OpenZFS work with the Linux kernel.
Fedora does not allow out of tree kernel modules to be packaged for the distribution. This has been the case since Fedora 7.
That is a strong argument. But obviously more a political rather than a technical one.
That does not change the fact that OpenZFS is a very *special* out of tree kernel module that would put a major crimp in doing a lot of things Fedora does now, like testing and validating snapshots of the Linux kernel as it is being developed. Fedora is a place where we actively work with our upstreams, and we stay close to those projects as part of maintaining software for them. Having kzfs in Fedora would strain that immensely.
Well, Fedora could become the platform where OpenZFS developers work closely with kernel developers. :)
All that said, I very well understand the hesitations of Fedora, and upstream kernel, developers to accommodate ZFS. I actually agree that in the current situation with licenses being what they are, and thus ZFS being an out-of-tree filesystem, it would not be wise to have ZFS as the default root file system in Fedora.
I personally have my /home filesystem on ZFS, and keep the root filesystem on an ext4 partition, as I am confident that I can reinstall Fedora in a reasonable amount of time, but I care about the data in my home/working directories and value immensely ZFS features with respect to data integrity and backups.
Regarding the current proposal at hand, i.e. making btrfs the default filesystem, I am actually in favour of that change. The next generation filesystems (i.e. btrfs and ZFS) have many desirable features ([1] lists a number of them, and that article is already quite old) and it's about time to switch also the desktop system to these filesystem IMHO.
Just my two cents.
For me the licensing issues are the big issues with ZFS. Or rather the licensing issue is so big for me that I haven't considered the technical merits of zfs for many years. While, if a way could be found, zfs could be an option I would be opposed to having it as default because of the licensing issues.
I understand that not everyone will agree and that this discussion has gone off on a tangent. I just needed to write this for some reason.
M
-Armin
[1] https://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cow... _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Once upon a time, John M. Harris Jr johnmh@splentity.com said:
For the best filesystem ever created, ZFS
My experiences with ZFS are less than impressive, definitely not "the best ever". Too many fiddly things, and questions where the answer is "back up and restore".
On 6/29/20 1:31 AM, Mark Otaris wrote:
The master branch for cp now defaults to copy-on-write on filesystems that support reflinks, which should make copies more efficient if Fedora starts using btrfs: https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=25725f9d41735d176....
Dolphin and KIO also seem like they will start doing this: https://bugs.kde.org/show_bug.cgi?id=326880, https://invent.kde.org/frameworks/kio/commit/c2faaae697f11ee600989b67b440698....
Beyond these recent changes, there are many other reasons to use btrfs, such as that Podman has a btrfs driver that might make containers more efficient, that ostree makes limited use of reflinks when they are available, that many filesystem options can be changed and new features and better defaults used even after the filesystem was initially created, that resize operations can be done online, and that there are uniform checksums on all metadata blocks, giving guarantees against corruption.
XFS also has reflinks, but lacks many features of btrfs, and switching from ext4 to XFS would mean losing cgroup writeback.
That's incorrect:
commit adfb5fb46af059387eca0fce1d8cd8733f9ee3a0 Author: Christoph Hellwig hch@lst.de Date: Fri Jun 28 19:30:22 2019 -0700
xfs: implement cgroup aware writeback
Link every newly allocated writeback bio to cgroup pointed to by the writeback control structure, and charge every byte written back to it.
Tested-by: Stefan Priebe - Profihost AG s.priebe@profihost.ag Signed-off-by: Christoph Hellwig hch@lst.de Reviewed-by: Darrick J. Wong darrick.wong@oracle.com Signed-off-by: Darrick J. Wong darrick.wong@oracle.com
-Eric
That’s one fewer reason not to use XFS then. It seems Documentation/admin-guide/cgroup-v2.rst was not updated and still says only ext2, ext4, and btrfs have writeback implemented.
On 6/29/20 12:44 PM, Mark Otaris wrote:
That’s one fewer reason not to use XFS then. It seems Documentation/admin-guide/cgroup-v2.rst was not updated and still says only ext2, ext4, and btrfs have writeback implemented.
Interesting, thanks for the heads up - I'll get that fixed.
Looks like f2fs also supports it; the dangers of listing details like that in a doc file I guess :(
-Eric
On Mon, Jun 29, 2020, at 6:43 PM, Markus S. wrote:
Why not Stratis?
Stratis cannot be used to build the root filesystem. (It's been answered elsewhere in the thread.)
V/r, James Cassell
On Mon, 2020-06-29 at 18:51 -0400, James Cassell wrote:
On Mon, Jun 29, 2020, at 6:43 PM, Markus S. wrote:
Why not Stratis?
Stratis cannot be used to build the root filesystem. (It's been answered elsewhere in the thread.)
Are we sure? https://github.com/stratis-storage/stratisd/issues/635 While it might not be super there yet it seems it is technically working (I may be wrong I have done 0 tests). But given how new that is and that tolling around it isn't there it pretty far from being a viable default.
V/r, James Cassell _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Hi,
On 29/06/2020 23:57, Markus Larsson wrote:
On Mon, 2020-06-29 at 18:51 -0400, James Cassell wrote:
On Mon, Jun 29, 2020, at 6:43 PM, Markus S. wrote:
Why not Stratis?
Stratis cannot be used to build the root filesystem. (It's been answered elsewhere in the thread.)
Are we sure? https://github.com/stratis-storage/stratisd/issues/635 While it might not be super there yet it seems it is technically working (I may be wrong I have done 0 tests). But given how new that is and that tolling around it isn't there it pretty far from being a viable default.
It is perhaps also worth mentioning, since I've not seen it elsewhere in this thread, that Stratis is part of the (larger) Project Springfield. This is aimed at improving the overall storage/fs management experience, and there are a number of parts of that landing in various places at the moment. There is more to come, of course, but the overall aim is improved user experience for whatever combination of fs/block devices are in use,
Steve.
On Tue, Jun 30, 2020 at 5:42 AM Steven Whitehouse swhiteho@redhat.com wrote:
Hi,
On 29/06/2020 23:57, Markus Larsson wrote:
On Mon, 2020-06-29 at 18:51 -0400, James Cassell wrote:
On Mon, Jun 29, 2020, at 6:43 PM, Markus S. wrote:
Why not Stratis?
Stratis cannot be used to build the root filesystem. (It's been answered elsewhere in the thread.)
Are we sure? https://github.com/stratis-storage/stratisd/issues/635 While it might not be super there yet it seems it is technically working (I may be wrong I have done 0 tests). But given how new that is and that tolling around it isn't there it pretty far from being a viable default.
It is perhaps also worth mentioning, since I've not seen it elsewhere in this thread, that Stratis is part of the (larger) Project Springfield. This is aimed at improving the overall storage/fs management experience, and there are a number of parts of that landing in various places at the moment. There is more to come, of course, but the overall aim is improved user experience for whatever combination of fs/block devices are in use,
This is the first time I've ever heard that codename, and you should really change it, because that name is already used for cloud-based security fuzzing from Microsoft Research. It's a great idea, though!
Improving the UX of storage management is generally a good thing, in my view. Btrfs provides significant improvements in this regard, but there can be even more. Tools like SSM[1] were great attempts at making the LVM experience not suck. Cockpit does a good job of making handling storage management a lot more approachable, too.
I'd be curious if you are only thinking of server cases, or if desktop cases are also being considered. Historically, projects like these from Red Hat are largely only for the server...
[1]: https://github.com/system-storage-manager/ssm
-- 真実はいつも一つ!/ Always, there's only one truth!
Hi,
On 30/06/2020 13:58, Neal Gompa wrote:
On Tue, Jun 30, 2020 at 5:42 AM Steven Whitehouse swhiteho@redhat.com wrote:
Hi,
On 29/06/2020 23:57, Markus Larsson wrote:
On Mon, 2020-06-29 at 18:51 -0400, James Cassell wrote:
On Mon, Jun 29, 2020, at 6:43 PM, Markus S. wrote:
Why not Stratis?
Stratis cannot be used to build the root filesystem. (It's been answered elsewhere in the thread.)
Are we sure? https://github.com/stratis-storage/stratisd/issues/635 While it might not be super there yet it seems it is technically working (I may be wrong I have done 0 tests). But given how new that is and that tolling around it isn't there it pretty far from being a viable default.
It is perhaps also worth mentioning, since I've not seen it elsewhere in this thread, that Stratis is part of the (larger) Project Springfield. This is aimed at improving the overall storage/fs management experience, and there are a number of parts of that landing in various places at the moment. There is more to come, of course, but the overall aim is improved user experience for whatever combination of fs/block devices are in use,
This is the first time I've ever heard that codename, and you should really change it, because that name is already used for cloud-based security fuzzing from Microsoft Research. It's a great idea, though!
Improving the UX of storage management is generally a good thing, in my view. Btrfs provides significant improvements in this regard, but there can be even more. Tools like SSM[1] were great attempts at making the LVM experience not suck. Cockpit does a good job of making handling storage management a lot more approachable, too.
I'd be curious if you are only thinking of server cases, or if desktop cases are also being considered. Historically, projects like these from Red Hat are largely only for the server...
So yes, SSM has been subsumed into Springfield too. There was a long debate over the project name, but nobody came up with anything better, so it has stuck...
There are a lot of things going on, although few of them have actually been labelled with Springfield, so perhaps not too surprising that the name is not so well known. There has been a new mount API upstream, for example which is part of that, as is also the fs notifications (of which the notifications core was merged in the most recent merge window, but the mount notifications and fsinfo syscall are still forthcoming).
There has also been work on PCP, to ensure that we have good metrics for a wide variety of filesystems, and there is a dashboard for GFS2 in Cockpit as part of that work. Cockpit is one of the important consumers of the APIs that fall under the Springfield umbrella.
There is libmount (which will get an update to take advantage of the kernel changes mentioned above) as well as udisks2, libstoragemgmt and blivet. The overall aim here is not to focus on one specific tool, but instead to look at the overall stack and figure out how to make the components work better with each other to provide a better user experience.
I know it has been rather confined to Red Hat internally, however that was not the intention, and in fact I would like to strongly encourage community involvement. There is an upstream mailing list, which currently has almost no traffic: springfield@sourceware.org so please do join and ask questions, if anybody is interested in finding out more.
There is no Springfield codebase as such - it is an umbrella project that involves a number of subprojects. Also, the reason that it is interesting is that the intent is to look at both the kernel and userspace parts of managing storage and filesystems and to improve the whole stack, rather than looking a small pieces in isolation. Our aim is to encourage discussion and cooperation between the individual subprojects.
To answer the earlier question, yes this it is intended for both workstation and server use cases. That is perhaps getting a bit off topic here, but hopefully it will help to clear up any confusion about what Springfield is/does,
Steve.
On Tue, Jun 30, 2020, at 10:18 AM, Steven Whitehouse wrote:
I know it has been rather confined to Red Hat internally, however that was not the intention, and in fact I would like to strongly encourage community involvement. There is an upstream mailing list, which currently has almost no traffic: springfield@sourceware.org so please do join and ask questions, if anybody is interested in finding out more.
Indeed, this is the first I've heard of "Project Springfield" -- it looks like the list only had a couple of messages at its start in 2018 and nothing since.
https://springfield-project.github.io/
The "subscribe" link is broken... it should probably point to https://sourceware.org/mailman/listinfo/springfield
I'd send a pull request, but I couldn't find the github repo associated with the page.
V/r, James Cassell
----- Original Message -----
From: "James Cassell" fedoraproject@cyberpear.com To: "Fedora Development List" devel@lists.fedoraproject.org Sent: Tuesday, June 30, 2020 6:08:30 PM Subject: Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants
On Tue, Jun 30, 2020, at 10:18 AM, Steven Whitehouse wrote:
I know it has been rather confined to Red Hat internally, however that was not the intention, and in fact I would like to strongly encourage community involvement. There is an upstream mailing list, which currently has almost no traffic: springfield@sourceware.org so please do join and ask questions, if anybody is interested in finding out more.
Indeed, this is the first I've heard of "Project Springfield" -- it looks like the list only had a couple of messages at its start in 2018 and nothing since.
https://springfield-project.github.io/
The "subscribe" link is broken... it should probably point to https://sourceware.org/mailman/listinfo/springfield
I'd send a pull request, but I couldn't find the github repo associated with the page.
https://github.com/springfield-project/springfield-project.github.io/blob/ma...
https://<name>.github.io/ -- you'll almost always find it at https://github.com/<name>/<name>github.io, unless it is a repo-specific GH pages.
Then it'd be:
https://<name>.github.io/<project>/ -- which you'll find somewhere in the https://github.com/<name>/<project> repo (either docs/ folder in the default branch or gh-pages branch).
https://help.github.com/en/github/working-with-github-pages/about-github-pag...
- Alex
V/r, James Cassell _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
On Mon, Jun 29, 2020, at 6:57 PM, Markus Larsson wrote:
On Mon, 2020-06-29 at 18:51 -0400, James Cassell wrote:
On Mon, Jun 29, 2020, at 6:43 PM, Markus S. wrote:
Why not Stratis?
Stratis cannot be used to build the root filesystem. (It's been answered elsewhere in the thread.)
Are we sure? https://github.com/stratis-storage/stratisd/issues/635 While it might not be super there yet it seems it is technically working (I may be wrong I have done 0 tests). But given how new that is and that tolling around it isn't there it pretty far from being a viable default.
Indeed, a comment there says, "Yes this should work for root fs. No write up has been done other than what is mentioned in this issue already."
So the stratis folks say it's possible. I'd say it's safe to say there's no Anaconda installer support for it (yet?), though... Once upon a time, I opened a Red Hat case asking how to use stratis for boot and they provided nothing useful.
V/r, James Cassell
Stratis allows users to manage XFS filesystems using a pool of block devices. Features include: thin-provisioning, snapshots, and encryption Stratis does come under the Springfield umbrella. We are an active team and happy to respond to our public mailing list, if anyone has questions at stratis-devel@lists.fedorahosted.org, or visit our webpage at https://stratis-storage.github.io/, or contribute at https://github.com/stratis-storage.
We have been carefully developing Stratis with many of the mentioned features in mind. We welcome positive and constructive input and feedback.
- Dennis
So, given this already has way too many answers I didn't want to reply, but after spending ~4 hours to get my laptop back to bootable state after a btrfs-convert I guess some people might be interested.
Overall thoughts for whoever doesn't want to read the rest is: I think btrfs the FS is probably good enough, but there is a lot of work on the tooling to do as has been said multiple times in this thread.
I realise most of the points I make won't impact a new system but it's just illustrating rough edges, and for a single day experience that is probably too many -- well I've converted now, hopefully will be happy ever after? ;)
Recap of the problems I ran into: - bug in btrfs-convert where it just aborts in the middle with an unintelligible message if the ext4 fs had some fscrypt files; the conversion fails with ENOTSUP but it's not printed properly because it's overwriting the progress line and there is no message on what failed, that was a good first impression... I've sent a patch to add the inode number that cause the problem as well as repeating the error code, it's a first step ; another fscrypt-specific message might be worth doing later eventually I'll report that separately. Removed these files.
- second bug in btrfs-convert, running scrub immediately after converting reported checksum errors. I had copied the whole disk over to debug the previous problem so could also reproduce that multiple times, it's not a transient hardware error I got this on different machines on the same files. This only impacted a single file I can recover but it's still annoying, anyway, I'm in the process of getting a bit more details before reporting this upstream as well, bugs happen, I was told btrfs-convert isn't used all that much, I didn't have more problems with the conversion itself so it could have been worse.
- after doing that conversion from initrd and rebooting, all services failed to start. I intuited that to be selinux problems and triggered a relabel that fixed everything, but it would be confusing to most people; I couldn't get to a shell because of the next point so not sure if X would have worked but the boot scrollup was scary (another user removing bootsplash present! :P)
- I have (had) a working kexec-kdump setup and usually set the sysctl kernel.panic_on_warn to 1... That also wasn't great because since someone suggested using flushoncommit here I went for it (sounds better than chattr +C to me?), but machine crashing after 30s wasn't fun, especially since kexec hadn't been reconfigured for btrfs yet so I didn't have time to read the warning. I think that flushoncommit should not be suggested before that warning gets removed officially, even if it is harmless for most people, it is really quite verbose and will hide any real problem that could happen.
Quite a shame, though; that really looks like a good idea, but this warning has been around since 2018 at least so I'm dubious
- after regenerating initrd for some reason it generated an empty crypttab and stopped prompting to decrypt the fs on reboot; there is some autodetection logic in /usr/lib/dracut/modules.d/90crypt/module-setup.sh that apparently worked before and doesn't anymore? There are multiple workarounds like adding rd.luks.name parameter or adding ',force' to crypttab options but this wasn't pleasant either
raid1 automount / proper warning also seems important to me but also has been properly ack'd so little point in repeating.
raid5 would be nice eventually, but the recap from Zygo on btrfs list is rather clear, I'll stay away from it a while longer even if most problems are workaroundable.
I think that's about it, points about VMs / DB needing either chattr +C or removing all fsync and using flushoncommit has been taken properly so I'm not too worried about that one, and rest looks like it will work fine to me. Compression in particular is a very noticeable gain for my local storage (mostly sources and git trees) so I had been wanting to try, this thread gave me the push...
(from a quick compsize on fs root: Type Perc Disk Usage Uncompressed Referenced TOTAL 66% 121G 183G 192G none 100% 87G 87G 88G zlib 46% 1.8K 4.0K 4.0K zstd 35% 33G 95G 104G I need to check what's uncompressible but probably git objects themselves, and a few VM images it looks like ; still more than decent)
Good work all,
On Mon, Jun 29, 2020 at 5:17 PM Dominique Martinet asmadeus@codewreck.org wrote:
Recap of the problems I ran into:
- bug in btrfs-convert where it just aborts in the middle with an
...
- second bug in btrfs-convert, running scrub immediately after
...
My view is that btrfs-convert is something of a proof of concept. Yes it should fail gracefully if it's going to fail, and the rollback should work short of having made certain modifications (listed in the documentation) post-install. And maybe there'd be interest in the Fedora community at some point down the road doing a test day or test week, to gather a lot of good data on converting ext4 to btrfs. I don't know but my suspicion is that as any file system ages, it's becoming increasingly non-deterministic in its layout, and that might affect the conversion success rate. New file systems seem to convert without problems, and sometimes older ones don't (and by older I mean 1-2 years.)
- I have (had) a working kexec-kdump setup and usually set the sysctl
kernel.panic_on_warn to 1... That also wasn't great because since someone suggested using flushoncommit here I went for it (sounds better than chattr +C to me?), but machine crashing after 30s wasn't fun,
They're noisy WARNON messages, but not crashes. And it's my mistake to even mention it. Fedora folks won't be asked or recommended to use that.
I think that's about it, points about VMs / DB needing either chattr +C
That's what will happen, and it'll be set for the user. Users aren't expected to know these things.
Compression in particular is a very noticeable gain for my local storage (mostly sources and git trees) so I had been wanting to try, this thread gave me the push...
(from a quick compsize on fs root: Type Perc Disk Usage Uncompressed Referenced TOTAL 66% 121G 183G 192G none 100% 87G 87G 88G zlib 46% 1.8K 4.0K 4.0K zstd 35% 33G 95G 104G I need to check what's uncompressible but probably git objects themselves, and a few VM images it looks like ; still more than decent)
chattr +c by default uses zlib. It's possible to specify zstd using 'btrfs property' - but again this too will be done for the user on clean installs. And it will be limited to select locations. Possibly down the road there can be desktop integration so folks can choose specific home directories.
I use -o compress=zstd:1 mount option instead to compress everywhere, but some sporadic benchmarks on one older machine suggests a small write time performance decrease, with no change in read performance. man 5 btrfs has more info.
Thanks for the reply.
It feels a bit dismissive after the time I spent there, so I'll assume I wasn't clear and my point didn't get across (I did send that mail at 1AM and haven't slept much lately...): I'm not complaining about any particular bug here, just that the overall use & feel is way too rough.
I think what we (fedora) need right now are more people using it, the thread is full of ""anecdotal evidences"" one way or another, mine isn't anything more or less than that, but first encourage people to use it more and polish tools/documentation then actually make it default. (If this can happen in time for f33 that would be great, but my opinion at this point isn't optimistic)
To give one more example I've remembered now, `btrfs scrub` will only report the first file that is corrupted for a given extent. btrfs being cow, it is possible (and was my case yesterday) that some of the extents belong to multiple files and there is no easy way to report all the files involved : the btrfs scrub status command should have an option to do that, really. kernel messages can be throttled so if some line is missing you'll miss a corrupted extent, and parsing dmesg to use `btrfs ins logical-resolve` is far from obvious to a new user.
I also feel the mix of "this command runs in foreground" (e.g. defrag, some variants of balance? not clear to me) and "this command starts a background thread" (e.g. scrub unless -B given) is a bit messy and confusing.
Yes we don't want users to actually run these manually, so we need things that need to run to automagically start in background and some nice gnome popup or whatever to notify of any problem instead; but that isn't here yet either.
Chris Murphy wrote on Mon, Jun 29, 2020:
Recap of the problems I ran into:
- bug in btrfs-convert where it just aborts in the middle with an
...
- second bug in btrfs-convert, running scrub immediately after
...
My view is that btrfs-convert is something of a proof of concept. Yes it should fail gracefully if it's going to fail, and the rollback should work short of having made certain modifications (listed in the documentation) post-install. And maybe there'd be interest in the Fedora community at some point down the road doing a test day or test week, to gather a lot of good data on converting ext4 to btrfs. I don't know but my suspicion is that as any file system ages, it's becoming increasingly non-deterministic in its layout, and that might affect the conversion success rate. New file systems seem to convert without problems, and sometimes older ones don't (and by older I mean 1-2 years.)
Yes, btrfs-convert isn't a battle-hardened tool; I'm not judging btrfs based on just this. Honestly, out of the 4 hours I spent last night; btrfs-convert wasn't even included there. I had failed first then prepared on another copy and things worked rather well overall -- it could get better error messages, a big warning in man page perhaps, but overall I've saved more time with btrfs-convert than I would have spent trying to juggle resizing partition multiple times to copy data over.
Once again: not complaining about the result, my point isn't about the bugs I reported but about documentation.
- I have (had) a working kexec-kdump setup and usually set the sysctl
kernel.panic_on_warn to 1... That also wasn't great because since someone suggested using flushoncommit here I went for it (sounds better than chattr +C to me?), but machine crashing after 30s wasn't fun,
They're noisy WARNON messages, but not crashes.
bugs eat data. seriously. I have panic_on_warn on all my systems, so a warn IS a crash for me. Because of the switch kexec wasn't reconfigured yet (needs more than 30s to rebuild initrd) and this was really annoying, I had to take a video of the screen to read it that's just about as bad as things can get...
With my kernel developer hat, I'll also say warnings should also never be ignored: even if you're smart enough to decide this one is begnin, you'll miss other bug messages if you let this one happen all the time.
And it's my mistake to even mention it. Fedora folks won't be asked or recommended to use that.
It's not just you. I've seen it recommended by Zygo on IRC as well just the other day. It's not broadly advertised but it is a good feature that really makes sense for some workloads, people will try to use it. What's frustrating is that it's been around since 4.15 and nobody seemed to care: either it's really harmless in the way btrfs use it and it should be quietened down (Zygo said he patches his kernels to remove the message), or it's not and it should be fixed.
For reference I currently am using: - autodefrag, because I read it helps with small dbs e.g. firefox sqlite databases and things like that and wanted to see the impact it has in the background - compress=zstd (not sure about your :1, I don't think the cpu usage difference will be that big; it's mostly increasing write latency as you said so shouldn't change much) - discard=async as recommended here and "will-soon-be-default" - ssd (already default based on rotational sysfs) - space_cache=v2 (soon-to-be-default)
Once again these stuff are hard for users to decide themselves so we should think about fedora defaults, but I think they're mostly well documented at least.
I think that's about it, points about VMs / DB needing either chattr +C
That's what will happen, and it'll be set for the user. Users aren't expected to know these things.
It still needs to be readily-available informations for "semi-advanced" users: these files won't get checksum at least.
Compression in particular is a very noticeable gain for my local storage (mostly sources and git trees) so I had been wanting to try, this thread gave me the push...
(from a quick compsize on fs root: Type Perc Disk Usage Uncompressed Referenced TOTAL 66% 121G 183G 192G none 100% 87G 87G 88G zlib 46% 1.8K 4.0K 4.0K zstd 35% 33G 95G 104G I need to check what's uncompressible but probably git objects themselves, and a few VM images it looks like ; still more than decent)
chattr +c by default uses zlib. It's possible to specify zstd using 'btrfs property' - but again this too will be done for the user on clean installs. And it will be limited to select locations. Possibly down the road there can be desktop integration so folks can choose specific home directories.
I use -o compress=zstd:1 mount option instead to compress everywhere, but some sporadic benchmarks on one older machine suggests a small write time performance decrease, with no change in read performance. man 5 btrfs has more info.
mount option seems simpler for me for clean installs - btrfs is smart enough to not compress if it detects the first few blocs aren't compressible, and force compress definitely isn't something users should need to care about.
Thanks,
While this isn't a problem for Fedora per se, it's worth noting.
OpenSUSE uses btrfs by default and as a result we're unable to open SUSE guest images from distros that don't include btrfs in the kernel (ie. RHEL, and maybe CentOS unless you use an alternate kernel). So that would apply to Fedora guests too.
This affects all sorts of things like libguestfs, virt-* tools, or even just attaching or loop-mounting a Fedora disk on a RHEL machine.
Rich.
On 26.06.2020 16:42, Ben Cotton wrote:
** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware
What can you say about this? https://arxiv.org/pdf/1707.08514.pdf
On Friday, July 10, 2020, Vitaly Zaitsev via devel < devel@lists.fedoraproject.org> wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware
What can you say about this? https://arxiv.org/pdf/1707.08514.pdf
It doesn't use compression so not relevant to the cited statement?
-- Sincerely, Vitaly Zaitsev (vitaly@easycoding.org) _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject. org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists. fedoraproject.org
It doesn't use compression so not relevant to the cited statement?
Well the paper compares ext2, ext4, xfs, f2fs, and btrfs in terms of IO amplification and states: "In fact, in all our experiments, btrfs was an outlier, producing the highest read, write, and space amplification."
The results listed in Tables 1 and 2 show that btrfs does incur higher amounts of IO, so even with compression it's not at all obvious that this would bring btrfs down to levels comparable to (or lower than) the other file systems. Hence I believe Vitaly is linking this paper to suggest that evidence is needed before we can confidently assert that btrfs + compression is better at preserving nand than using ext4 or xfs.
BTRFS WA is ~8 times higher than ext4. Average profit from compression about 50% max. Not that hard arithmetic.
Artem Tim wrote on Sat, Jul 11, 2020:
BTRFS WA is ~8 times higher than ext4. Average profit from compression about 50% max. Not that hard arithmetic.
It's not that simple. The pattern used in that paper is far from a standard workload (random writes within a file with cow is just about as bad as things can get wrt. write amplification) ; so things like the sqlite db firefox uses in your home will be worse as far as that goes with btrfs even if compressed yes certainly.
But if you're talking open w/ truncate (or new file), write in a single stride, close and never write again (like what happens when you upgrade packages, compile something, download something etc etc) then the difference won't be that big.
As Chris said multiple times, it's hard to find the right way to measure impacts, and I don't have good solutions either, but this definitely isn't the kind of usage I make of my filesystem. I'd be tempted to believe the feedback from facebook on that one, even if adding snapshots into the mix it's not 100% clear if compression has much impact by itself either...
BTW, given the size gains ws. time difference for compression I would advocate for default zstd compression instead of :1 -- I'd think another 12% compression improvement[1] for almost no time difference isn't to be sneezed at?
[1] https://www.spinics.net/lists/fedora-devel/msg274978.html
On 11.07.2020 14:20, Dominique Martinet wrote:
BTW, given the size gains ws. time difference for compression I would advocate for default zstd compression instead of :1 -- I'd think another 12% compression improvement[1] for almost no time difference isn't to be sneezed at?
Now please open this file again and check Data Operations table.
Vitaly Zaitsev via devel wrote on Sun, Jul 12, 2020:
On 11.07.2020 14:20, Dominique Martinet wrote:
BTW, given the size gains ws. time difference for compression I would advocate for default zstd compression instead of :1 -- I'd think another 12% compression improvement[1] for almost no time difference isn't to be sneezed at?
Now please open this file again and check Data Operations table.
The passage you quoted has nothing to do with the paper, but about another part of the thread.
Note I never said this paper wasn't about data, nor that btrfs is good wrt write amplification in the access pattern they do in this paper ; I said you might want to think whether a pattern with small writes and fsyncs in a cow filesystem is a good idea and is going to be representative of a typical user's workload.
That is precisely why previously in the thread things like databases have been discussed along with the nocow chattr flag.
(Yes, that means applications need to start being concious of what fs they are being run on, or at least the fedora configuration needs to do that check for them)
On Sun, Jul 12, 2020 at 7:51 AM Dominique Martinet asmadeus@codewreck.org wrote:
(Yes, that means applications need to start being concious of what fs they are being run on, or at least the fedora configuration needs to do that check for them)
Good luck with that. It's a direct violation of the "object oriented" approach to programming, and violates the "layers of abstraction" that many programmers are taught.
(Yes, that means applications need to start being concious of what fs they are being run on, or at least the fedora configuration needs to do that check for them)
Right, and it's concerning to me that Fedora is committing to btrfs by default before important applications have become more enlightened about running on btrfs. If upstream changes don't land in time for Fedora 33, we will be implicitly expecting users to be aware of these pitfalls and leave them to implement manual workarounds. I'd imagine a good bit of thought and work will have to go into creating, testing, and upstreaming those patches, so I think it's very possible that an appreciable number of changes will not land in time for Fedora 33.
For example with virtualization I'd think that the changes would need to happen around the level of libvirt, and not to specific a front-end like GNOME boxes or virt-manager. It's also probably not sufficient to just set nodatacow on the default VM image directory as users may use a non-default directory for qcow2 images. Hence I don't think these issues will always have trivial solutions.
On Sun, Jul 12, 2020 at 1:31 PM Tom Seewald tseewald@gmail.com wrote:
(Yes, that means applications need to start being concious of what fs they are being run on, or at least the fedora configuration needs to do that check for them)
Right, and it's concerning to me that Fedora is committing to btrfs by default before important applications have become more enlightened about running on btrfs. If upstream changes don't land in time for Fedora 33, we will be implicitly expecting users to be aware of these pitfalls and leave them to implement manual workarounds. I'd imagine a good bit of thought and work will have to go into creating, testing, and upstreaming those patches, so I think it's very possible that an appreciable number of changes will not land in time for Fedora 33.
What changes?
For example with virtualization I'd think that the changes would need to happen around the level of libvirt, and not to specific a front-end like GNOME boxes or virt-manager. It's also probably not sufficient to just set nodatacow on the default VM image directory as users may use a non-default directory for qcow2 images. Hence I don't think these issues will always have trivial solutions.
Discussion is happening upstream to determine the best location for such optimization to happen. And there are fallback positions that are well understood. qemu-img can already create nodatacow qcow2 files, since a long time, that is one possible option. Another is setting 'chattr +C' as a post-install script in the installer at install time. For databases whether there's benefit is more variable, and it's not certain if it always needs to be set, and if so where - but there is a fallback here too which is having the RPM set it at install time. https://www.redhat.com/archives/libvir-list/2020-July/msg00450.html
(open)SUSE has been doing this for six years, I don't think it's correct to suggest these complexities aren't well understood, or that it's possible they won't happen for Fedora 33.
What changes?
I don't see a reason for this level of snark, in your next paragraph you described the changes I'm talking about.
Discussion is happening upstream to determine the best location for such optimization to happen.
I'm glad work is happening upstream and I hope it goes smoothly, but I don't see how there is a guarantee that everything will be done in time for Fedora 33. I'm not implying people aren't working on it, I'm suggesting that reality often doesn't go exactly as envisioned.
(open)SUSE has been doing this for six years, I don't think it's correct to suggest these complexities aren't well understood, or that it's possible they won't happen for Fedora 33.
On one hand you're saying that all complexities are well understood and that it is incorrect of me to suggest that any of those changes could miss F33's release date. Yet on the other hand you mention that openSUSE has used btrfs by default for 6 years, so then why haven't these changes landed upstream years ago? Further, you stated that for databases it's not yet clear when/if nodatacow is a performance win. This suggests that it is not outlandish to say there are remaining complexities.
On Sun, Jul 12, 2020 at 5:33 PM Tom Seewald tseewald@gmail.com wrote:
What changes?
I don't see a reason for this level of snark, in your next paragraph you described the changes I'm talking about.
Discussion is happening upstream to determine the best location for such optimization to happen.
I'm glad work is happening upstream and I hope it goes smoothly, but I don't see how there is a guarantee that everything will be done in time for Fedora 33. I'm not implying people aren't working on it, I'm suggesting that reality often doesn't go exactly as envisioned.
(open)SUSE has been doing this for six years, I don't think it's correct to suggest these complexities aren't well understood, or that it's possible they won't happen for Fedora 33.
On one hand you're saying that all complexities are well understood and that it is incorrect of me to suggest that any of those changes could miss F33's release date. Yet on the other hand you mention that openSUSE has used btrfs by default for 6 years, so then why haven't these changes landed upstream years ago? Further, you stated that for databases it's not yet clear when/if nodatacow is a performance win. This suggests that it is not outlandish to say there are remaining complexities.
They did arrive in upstream YaST, and they can go in upstream Anaconda. (open)SUSE do this in their installer. We can do the same. The question is whether it's better to do these things elsewhere.
On Sun, Jul 12, 2020 at 07:31:30PM -0000, Tom Seewald wrote:
For example with virtualization I'd think that the changes would need to happen around the level of libvirt, and not to specific a front-end like GNOME boxes or virt-manager. It's also probably not sufficient to just set nodatacow on the default VM image directory as users may use a non-default directory for qcow2 images. Hence I don't think these issues will always have trivial solutions.
The next release of libvirt (due in rawhide in ~1 week time) has improved btrfs support. It will default to always setting "nocow" for any libvirt storage pool that is created on a btrfs filesystem, with option to override if desired. This should make GNOME Boxes, virt-manager, virt-install and cockpit all "do the right thing" out of the box in most circumstances.
Regards, Daniel
On Sat, Jul 11, 2020 at 6:11 AM Artem Tim ego.cordatus@gmail.com wrote:
BTRFS WA is ~8 times higher than ext4. Average profit from compression about 50% max. Not that hard arithmetic.
The paper is with respect to metadata write amplification. This has no effect on data writes. Compression applies to data writes, not metadata. As the data amount is significantly larger than metadata (the file system itself), any reduction in data writes overwhelms the metadata writes.
On 11.07.2020 17:28, Chris Murphy wrote:
The paper is with respect to metadata write amplification. This has no effect on data writes. Compression applies to data writes, not metadata. As the data amount is significantly larger than metadata (the file system itself), any reduction in data writes overwhelms the metadata writes.
Please open this file again and check Data Operations table.
On 7/10/20 10:14 AM, Vitaly Zaitsev via devel wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware
What can you say about this? https://arxiv.org/pdf/1707.08514.pdf
I would say that it illustrates the reason that compression is being proposed. What did you take away from it?
On Fri, Jul 10, 2020 at 07:14:09PM +0200, Vitaly Zaitsev via devel wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware
What can you say about this? https://arxiv.org/pdf/1707.08514.pdf
Also funny note: when compression was introduced in ZFS, circa 2007, it was mainly promoted as _performance_ win, not a space saving measure. This was still 5 years before NVMe, so all we had was SATA, SAS and FC drives, yet the CPUs were already multi-core and multi-gigahertz. Transfering uncompressed data was _slower_ than compressing/decompressing and having to transfer less data. For a bit higher CPU usage we got noticeable bandwidth wins. The tradeoff is no longer there, as single drives reach 7GiB/s transfer speed.
On Fri, Jul 10, 2020 at 1:45 PM Tomasz Torcz tomek@pipebreaker.pl wrote:
On Fri, Jul 10, 2020 at 07:14:09PM +0200, Vitaly Zaitsev via devel wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware
What can you say about this? https://arxiv.org/pdf/1707.08514.pdf
Also funny note: when compression was introduced in ZFS, circa 2007, it was mainly promoted as _performance_ win, not a space saving measure. This was still 5 years before NVMe, so all we had was SATA, SAS and FC drives, yet the CPUs were already multi-core and multi-gigahertz. Transfering uncompressed data was _slower_ than compressing/decompressing and having to transfer less data. For a bit higher CPU usage we got noticeable bandwidth wins. The tradeoff is no longer there, as single drives reach 7GiB/s transfer speed.
It would need to be benchmarked. The CPU in these cases has also improved dramatically, perhaps more significantly than storage performance. In which case, the compression may still not be a limiting factor. lzbench is useful for this. Compiling it on Fedora is straight forward but needs this hint or some improved understanding of the problem
https://github.com/inikep/lzbench/issues/69
Note, you should use -b 128K since the Btrfs compress block size is 128KiB. There are a variety of corpuses available, I use silesia.tar
http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
But you can also just tar /usr or /home.
There is error introduced with this benchmark. Btrfs compression is per file. Any files less than 128K tend to have lower compression ratio, so there is an overestimate of compression by lzbench in this regard; whereas there's btrfs inline extents possible and in that regard the compression is underestimated (or more correctly the actual cost of the write). Another error is single thread vs multiple thread compression, and single queue vs multi queue block device. Another error is lzbench has essentially no latency, it's just one file being tested. Whereas real world usage there's many files being read and written, each with latency, during which time compression can happen for essentially no additional latency cost. But not always for no cost. So it's actually really complicated and probably why no one really wants to do this kind of detailed benchmarking analysis. We're probably better off making a new benchmark based on ordinary things: compiling the kernel, launching applications, doing updates, git updating and git log searching, etc. But even that is just a guess.
That reminds me: a git based approach for aging a file system. https://www.usenix.org/system/files/hotstorage19-paper-conway.pdf https://github.com/saurabhkadekodi/geriatrix
I haven't messed around with that, but maybe someone wants to turn that into a how to. I'll do the testing if no one wants to burn their SSD with writes. I've got a Samsung 840 EVO on an old laptop that I'm actively trying to kill off.
Something that isn't accountable without blind studies involving users, is some latencies users are hyper sensitive to and other latencies they aren't at all sensitive to. I haven't dug up any research on this, but I imagine it has been. Apple did a bunch of UI changes early in the Mac OS X development cycle and while overall latencies were lower as a result of having an (almost) preemptive multitasking OS instead of the former cooperative multitasking OS, the GUI had so much "eye candy" special effects that users got pissed at how slow the OS seemed.
On Fri, Jul 10, 2020 at 11:14 AM Vitaly Zaitsev via devel devel@lists.fedoraproject.org wrote:
On 26.06.2020 16:42, Ben Cotton wrote:
** transparent compression: significantly reduces write amplification, improves lifespan of storage hardware
What can you say about this? https://arxiv.org/pdf/1707.08514.pdf
The paper states its bias in the conclusion. It is a conjecture. They're trying to demonstrate using the worst case possible scenario testing of file systems in use (they do in fact behave this way) that a new file system needs to be developed, and for the use case they have in mind all of the evaluated general purpose file systems are disqualified. If you aren't looking to disqualify all general purpose file systems for your use case, this is not the paper for you.
Intentionally not explored, are various file system optimizations to mitigate this problem and real world general purpose workloads. In the case of Btrfs, those include delayed allocation, treelog, inline extents, and the default 16KiB leaf size.
The paper discounts entirely the workloads where fsync() isn't used. The paper admits this. "We should note that write amplification is high in our workloads because we do small writes followed by a fsync()." Many small file writes on a general purpose file system are quite a lot less than this, and on Btrfs many of those writes will be inline extents. i.e. they are stored inside the 16KiB leaf along with their inode entry. In the case of many recurring writes, the actual write pattern coalesces many file changes into the same leaf that's going to be written anyway. Yes, there is a big hit for that first write, but all the other writes are cheaper, maybe even free, if they happen inside the commit window. It's also a good reason to not fsync() the heck out of everything needlessly.
Finally, they are only looking at metadata writes. This is a tiny amount of writes compared to the data payload. Any compression of data will produce overwhelming reduction on net write amplification.
If we look at another paper with a different bias that's already been cited in devel@ discussions, "Evaluating File System Reliability on Solid State Drives" by Jaffer, et al - they say "Most notably Btrfs [46], a copy-on-write file system which is more suitable for SSDs with no in-place writes, has garnered wide adoption. The design of Btrfs is particularly interesting as it has fewer total writes than ext4’s journaling mechanism." How do we square this statement with the previous paper? They are looking at different workloads.