On Tue, Jan 12, 2021 at 11:05 PM Chris Murphy <lists(a)colorremedies.com> wrote:
Short version: Josef (Btrfs dev) and I agree there's probably
something wrong with the drive. The advice is to replace it, just
because trying to investigate further isn't worth your time, and also
necessarily puts your data at risk.
Longer version:
LVM thinp uses device-mapper as its backend to do all the work, and we
see checksum errors in the months old report. Where LVM thick has a
simpler on-disk format, so it's not as likely to discover such a
problem. And LUKS/dm-crypt is also using device-mapper as its backend
to do all work. So the two problems have two things in common: the
drive, and device-mapper. It's more probable there's an issue with the
drive, just because if there was a problem with device-mapper, we'd
have dozens of reports of it at the rate you're seeing this problem
once every couple of months (if that trend holds).
Is it possible LVM+ext4 on this same drive is more reliable? I think
that's specious. The problem can be masked due to much less stringent
integrity guarantees, i.e. there are no data checksums. Since the data
is overwhelmingly the largest target for corruption, just being a much
bigger volume compared to file system metadata. All things being
equal, there's a greater chance the problem affects data. On the other
hand, if it adversely impacts metadata, it could be true that e2fsck
has a better chance of fixing the damage than btrfsck right now. Of
course no fsck fixes your data.
So if you keep using the drive, you're in a catch-22. Btrfs is more
sensitive because everything is checksummed, so the good news is
you'll be informed about it fairly quickly, the bad news is that it's
harder to repair in this case. If you revert to LVM+ext4 the automatic
fsck at startup might fix up these problems as they occur, but it's
possible undetected data corruption shows up later or replicates into
your backups.
Regardless of what you decide to do, definitely keep frequent backups
of anything important.
Ok first I don't mean to imply that I don't believe you or Josef when
you say there is something wrong with my HDD or that you are wrong.
But I have a lot of questions that I want to discuss :
1) Is it possible there is nothing wrong with my drive, but there is
something with my BIOS/HDD Firmware ? May be my firmware is not
capable of BTRFS's stringent write requirements ?
I say this because I have used Windows with NTFS on this machine, I
have used Ubuntu with EXT4, and Fedora with thick-LVM with EXT4. None
of these configurations gave me any such problems.
2) Since there is a high likelihood that my filesystem is not
completely fixed, then when I take a backup using partclone, dd or
clonezilla won't those errors be carried over ?
Even if I buy a new drive and restore the backup, I still might get crashes.
3) This is a weird question but can you recommend me a HDD that I can
buy which can handle BTRFS ? Or even which features I might look for
while buying (not a SSD but a HDD)
4) My manufacturer HP, does not make firmware updates for Linux, only
for Windows. So is there a way to update the firmware(if available)
without being on Windows ? Any ideas? Would a Windows PXE help ?
5) When you say "checksum errors in the month's old report" - which
report are you referring to ? The thin-LVM crash or the smartctl crash
?
--
Regards,
Sreyan Chakravarty