On Mon, 6 Jul 2020 at 01:19, Chris Murphy <lists(a)colorremedies.com> wrote:
On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen <sandeen(a)redhat.com> wrote:
>
> On 7/3/20 1:41 PM, Chris Murphy wrote:
> > SSDs can fail in weird ways. Some spew garbage as they're failing,
> > some go read-only. I've seen both. I don't have stats on how common it
> > is for an SSD to go read-only as it fails, but once it happens you
> > cannot fsck it. It won't accept writes. If it won't mount, your only
> > chance to recover data is some kind of offline scrape tool. And Btrfs
> > does have a very very good scrape tool, in terms of its success rate -
> > UX is scary. But that can and will improve.
>
> Ok, you and Josef have both recommended the btrfs restore ("scrape")
> tool as a next recovery step after fsck fails, and I figured we should
> check that out, to see if that alleviates the concerns about
> recoverability of user data in the face of corruption.
>
> I also realized that mkfs of an image isn't representative of an SSD
> system typical of Fedora laptops, so I added "-m single" to mkfs,
> because this will be the mkfs.btrfs default on SSDs (right?). Based
> on Josef's description of fsck's algorithm of throwing away any
> block with a bad CRC this seemed worth testing.
>
> I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G
> image, or a bit less than 1% of the filesystem blocks, at random.
> This is 1/4 the fuzzing rate from the original test.
>
> So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair,
> mount, mount w/ recovery, and then restore ("scrape") if all that
> fails, see what we get.
What's the probability of this kind of corruption occurring in the
real world? If the probability is so low it can't practically be
computed, how do we assess the risk? And if we can't assess risk,
what's the basis of concern?
Aren't most disk failure tests 'huh it somehow happened at least once
and I think this explains all these other failures too?' I know that
with giant clusters you can do more testing but you also have a lot of
things like
What is the chance that a disk will die over time? 100%
What is the chance that a disk died from this particular scenario?
0.00000<maybe put a digit here> %
reword the question slightly differently.. What is the chance this
disk died from that scenario? 100%.
For the HPC computers we had a score of Phd staticians coming up with
all kinds of papers on disk failure modes which if asked in one way
would come up with practically 0% odds it would happen. However all of
the disk failures had happened at least once over a time frame...
sometimes a short one, sometimes a long one, sometimes so often that
someone had to retract a paper because it was clear that while the
maths said it shouldn't happen .. it did in real life. <welcome to HPC
at high altitudes.. cosmic rays, low air pressure, and dry air need to
be factored in>
--
Stephen J Smoogen.