On Sun, Jun 28, 2020 at 4:54 PM Alexandre de Farias
On Sun, 2020-06-28 at 15:40 -0600, Chris Murphy wrote:
> On Sun, Jun 28, 2020 at 9:04 AM <alexandrebfarias(a)gmail.com> wrote:
> > I'm willing to perform further testing. There shouldn't be anything
> > very special about my workload. I was working mostly with NodeJS 12
> > and React Native. VS Code (I should mention I make use of TabNine,
> > which can be a huge drain on system resources). So, in a typical
> > work session I'd have android emulator open, PostgreSQL, some
> > chrome tabs, VS Code, probably Emacs, plus the React Native metro
> > server and an Express.js backend.
> Databases and VM images are things btrfs is bad at out of the box.
> Most of this has to do with fsync dependency of other file systems.
> Btrfs is equipped to deal with an fsync heavy world out of the box,
> using treelog enabled by default. But can still be slow for some
Can we do enough to make for a pleasant user experience? Are the btrfs
mitigations sufficient? Do we have good enough userspace tools to
actually take profit from BTRFS new features?
Yes, yes, and yes. I think your case was mostly perturbed by the SSD
running out of eb's (erase blocks), and possibly also some slow down
from the current (original) free space cache. The discard=async option
is not default, and it's also new since 5.6. Right now Fedora is using
fstrim.timer once per week, but for a small SSD partition with a heavy
workload it may not be frequent enough.
Also, is there any reason as to why RHEL went
with XFS as a default and Fedora stayed with ext4?
Fedora wanted shrink support.
> (a) small for the workload and (b) not getting any hints about
> freed up for it to prepare for future writes. The SSD is trying to
> erase blocks right at the moment of allocation - super slow for any
> SSD to do that.
That's a strong possibility. I did increase the partition size and
things were better for a while, after defragmenting, etc. The small
initial size for the partition was what I believe many users will do
when trying a new operating system. I wasn't even sure I wanted to jump
ship and that was the space I could initially spare. Isn't that the
case for many users?
Yep. And in no way am I suggested you could or should have known such
an esoteric thing. I'm only stating what I think was going on and how
to mitigate it. Should SSD's be better provisioned? I can't answer
that because if I say yes I'm also saying they should cost more
because it does in fact more to provision them for heavier workloads.
And your workload isn't unusual per se, but it's a fairly heavy
metadata centric workload that still beats up on the SSD in part due
to metadatacow and datacow.
And hence discard=async I think would have solved it. But it's
speculation until tested. It does fit the pattern exactly, though.
Should discard=async be the default for SSDs? Perhaps.
I mean, my Thinkpad X220 still has BTRFS on it on much worse
speeds (WD Green) and I never got to experience performance issues. But
then, I just went ahead and added nodatacow to prevent it from going
south like the other notebook.
It may have been enough in this case.
Will the average user really benefit from BTRFS? I really like the
stuff, but with so many rough edges, I find it hard to put it forward
for general use yet. I mean, of course Fedora is kind of bleeding edge,
but as of now, I believe it's hard to explain all the hoops you have to
go for a default option.
Features owners are confident the average user will benefit now and in
the future, there will be more to come.
> 2. Mount option flushoncommit (you'll get benign, but
> WARNONS in dmesg)
> And use fsync = off in postgresql.conf (really everywhere you can)
> Note: if you get a crash you'll lose the last ~30s of commits, but
> database and the file system are expected to be consistent. The
> interval is configurable, defaults to 30s. I suggest leaving it there
> for testing. It is mainly a risk vs performance assessment, as to why
> you'd change it.
I tried going with a 120s commit interval. Never used flushoncommit
flushoncommit is a substitute for dropping all the application fsync's
that you can. Most workloads, the treelog is adequate for handling
fsync's decently. Hence it's the default.
Do we know all of the workloads that could be disrupted by BTRFS?
chattr +C or nodatacow is BTRFS any beter than LVM or other existing
Yes and yes. Btrfs offers integrity checking, compression, and a
complete IO isolation story. And more.
It might seem weird to optimize Btrfs to turn off a feature, but I see
the granular ability to compress some things and not others, to
usually do datacow everywhere except VM images, as a feature itself.
Have all options been considered? Is BTRFS really the FS of the
Yes. And it's the fs of the present and the future. But you're asking
these questions to one of the feature owners so I might be biased.
Has it been compared with other less-known but production proven
filesystems like NILFS? F2FS?
They have fewer features we want to use overall, and also I think the
resources to support them is less than in the Btrfs case. Further, no
installer support whereas Anaconda has substantial Btrfs support for a
BTRFS has enough caveats that its
popularity doesn't seem a compelling enough argument to just narrow
down the field to two contenders yet.
It's not a popularity contest at all. It's about serving use cases and
solving problems. Btrfs does this now.
Consider also there is a long history between Fedora and Btrfs. It was
approved as the default file system by FESCo in 2011 (bust postponed
pending work on fsck). And it was also at least an aspirational goal
to be the default file system during the significant Fedora.next
effort, by the Workstation working group, and also agreed to by FESCo.
There is a reasonable argument that these decisions have been properly
delayed until this very moment.
Also, I wouldn't find it very
interesting to tell the end-users they can't use all of their HD's
space or else their PC might be slowed down to a crawl.
That isn't what I said. I said I think it was a contributing factor to
the problem, there's more than one thing going on, and I think it
would be mitigated by using discard=async.
>  From your email, the kickstart shows
> > part btrfs.475 --fstype="btrfs" --ondisk=sda --size=93368
> 93G is likely making things worse for your SSD. It's small for this
> workload. Chances are if it were bigger, it'd cope better by
> effectively being over provisioned, and it'd more easily get erase
> blocks ready. But discard=async will mitigate this.
Are there any tests out there which could realistically mimic those
scenarios? Not just the fresh out of the box scenario, but after a few
weeks usage. I get the feeling that if you don't have enough space,
even defragmenting ends up being uneffective.
I think it's likely that the balancing and defragmenting made your
problem worse. And that's because those tasks relocate extents and
block groups thereby eating up even more of the erased blocks on the
SSD, without also giving the SSD the hint about what blocks are now
free for erasure. And this puts your SSD in the difficult position of
not knowing what erase blocks it can erase. An SSD that has to do
on-demand block erasure will be very slow.
I do not think the problem is any one single thing but a combination
of them just unluckily betrayed you: workload, SSD, and Btrfs. I think
the main optimization that would have helped is discard=async but like
i mentioned there are others that may play a role too.
Does the BTRFS defragment
command produce the same results on a full disk as it would on an
overprovisioned system? I understand it'll obviously take longer with
less space, but there's no obvious reason to me as to why the end
results should be qualitatively too disparate. (not sure if this is
actually the case, just hypothesizing based on my anecdotal evidence).
I don't think the problem you were having has anything at all to do
with fragmentation. But I also don't know what files you were
defragmenting or how many extents (fragments) they had before being
fragmented. Do you remember?
Anyway, if a proper test flow can be established for those awkward,
for one might be able to find the time to attempt provisioning a BTRFS
partition spanning the whole drive and putting those hypothetical
questions to test. At this point, with so many questions and so few
answers, I really think there needs to be some objective data to
justify this decision.
Honestly I think it's better if you set up the exact same scenario as
before. And only add discard=async and space_cache=v2 for now and see
what that does first. Then see if the workload itself needs optimizing
(don't use fsyncs for the database, use flushoncommit instead).
Sticking with ext4 together forever just because it works would be
foolish. But that doesn't mean it's wise to just desperately commit to
There is no desperation. It's been a consideration for ~9 years in Fedora.
Wouldn't it be better to do a much smaller change like going
with XFS as a default just like RHEL and try to encourage people to
test BTRFS in the meanwhile until there are better answers?
I think it would be a valid proposal. I think XFS is a fine file
system. As an owner of the Btrfs by default proposal, I think it's the
better option. And the reasoning for that is in the proposal.
answers be provided without real-world testing? I really don't know,
but all those questions come to my mind. I've tinkered with Linux since
I was 10 years old and surely broke my system in almost every way
possible, but this experience with BTRFS really stands out.
Correlation is not causation. There's more going on in your use case
than just Btrfs, so while it's a contributing factor, there are also
other contributing factors. They all need to be taken into account to
solve an edge case like this.