I currently have a pair of external drives configured as ext4 with RAID1 using mdadm, and mainly used for backup. My / and /home filesystems are now BTRFS so I'm looking at converting the external drives to BTRFS with RAID1. My main reason is to take advantage of BTRFS checksumming as a guard against bitrot, but I'd also like the flexibility of setting up subvolumes with different properties (the disks are currently 90% empty).
Any thoughts on this? What would be the simplest conversion strategy if I go ahead?
poc
On 31/03/2021 19:49, Patrick O'Callaghan wrote:
I currently have a pair of external drives configured as ext4 with RAID1 using mdadm, and mainly used for backup. My / and /home filesystems are now BTRFS so I'm looking at converting the external drives to BTRFS with RAID1. My main reason is to take advantage of BTRFS checksumming as a guard against bitrot, but I'd also like the flexibility of setting up subvolumes with different properties (the disks are currently 90% empty).
Any thoughts on this? What would be the simplest conversion strategy if I go ahead?
Just a thought.... If the drives are 90% empty and used mainly for backup and there is sufficient space on the local storage for the remainder I'd take the path of least resistance.
I'd copy the "non-backedup" stuff to local storage and then blow the rest away and create the raid/btrfs volumes from scratch. After that is done, move the stuff back and do a backup. I'm lazy. And if possible, I'd rather not do something I'm unlikely to have to do again in the future. :-) :-)
On Wed, 2021-03-31 at 21:17 +0800, Ed Greshko wrote:
On 31/03/2021 19:49, Patrick O'Callaghan wrote:
I currently have a pair of external drives configured as ext4 with RAID1 using mdadm, and mainly used for backup. My / and /home filesystems are now BTRFS so I'm looking at converting the external drives to BTRFS with RAID1. My main reason is to take advantage of BTRFS checksumming as a guard against bitrot, but I'd also like the flexibility of setting up subvolumes with different properties (the disks are currently 90% empty).
Any thoughts on this? What would be the simplest conversion strategy if I go ahead?
Just a thought.... If the drives are 90% empty and used mainly for backup and there is sufficient space on the local storage for the remainder I'd take the path of least resistance.
I'd copy the "non-backedup" stuff to local storage and then blow the rest away and create the raid/btrfs volumes from scratch. After that is done, move the stuff back and do a backup. I'm lazy. And if possible, I'd rather not do something I'm unlikely to have to do again in the future. :-) :-)
Yes, pretty much what I had in mind.
poc
Nothing to add but the usual caveats: https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
I use udev for that instead of init scripts. Concept is the same though, you want SCT ERC time to be shorter than kernel's command timer.
The btrfsmaintenance package has a scrub.timer that you can configure and enable. It's sufficient to scrub once a month or two, but doesn't hurt to do it more often.
There is a nagios btrfs plugin somewhere for monitoring. Otherwise something that parses 'btrfs device stats' for value changes would also work.
A rather different thing about Btrfs compared to mdadm raid, is there's no concept of faulty drives, i.e. btrfs doesn't "kick" drives out. Since it can unambiguously know if any block is corrupt or not, it keeps a pesky failing drive in, and just complains (a lot) about the bad blocks while using the good blocks.
Device replacement should use 'btrfs replace' rather than 'btrfs device add/remove'.
More info in 'man mkfs.btrfs' and 'man 5 btrfs'.
It's possible to simulate various problems with loop devices or in a VM. A fun one is setting up dm-flakey for one of the mirrors. A bit easier to chew off is to compile btrfs-corrupt-block.c from btrfs-progs source (it is not included in the btrfs-progs package in Fedora) which has various options for corrupting data and metadata, ability to choose which copy gets the damage, etc. It can be surprising how verbose btrfs is with just one data block is corrupt, when shared among multiple snapshots. It'll tell you about all of the instances of the files sharing that one bad block.
-- Chris Murphy
On Wed, 2021-03-31 at 18:00 -0600, Chris Murphy wrote:
Nothing to add but the usual caveats: https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
That´s pretty scary, though the drives I´m using are 1TB units scavenged from my extinct NAS so are unlikely to be SMR. They´re both WD model WD10EZEX drives.
I use udev for that instead of init scripts. Concept is the same though, you want SCT ERC time to be shorter than kernel's command timer.
I´ve been using MD for a while and haven´t seen any errors so far.
The btrfsmaintenance package has a scrub.timer that you can configure and enable. It's sufficient to scrub once a month or two, but doesn't hurt to do it more often.
Yes, I have that enabled already, running once a month.
Thanks.
poc
On Thu, Apr 1, 2021 at 5:53 AM Patrick O'Callaghan pocallaghan@gmail.com wrote:
On Wed, 2021-03-31 at 18:00 -0600, Chris Murphy wrote:
Nothing to add but the usual caveats: https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
That´s pretty scary, though the drives I´m using are 1TB units scavenged from my extinct NAS so are unlikely to be SMR. They´re both WD model WD10EZEX drives.
It's not an SMR concern, it's making sure the drive gives up on errors faster than the kernel tries to reset due to what it thinks is a hanging drive.
smartctl -l scterc /dev/sdX
That'll tell you the default setting. I'm pretty sure Blues come with SCT ERC disabled. Some support it. Some don't. If it's supported you'll want to set it for something like 70-100 deciseconds (the units SATA drives use for this feature).
And yeah, linux-raid@ list is chock full of such misconfigurations. It filters out all the lucky people, and the unlucky people end up on the list with a big problem which generally looks like this: one dead drive, and one of the surviving drives with one bad sector that was never fixed up through normal raid bad sector recovery mechanism, because the kernel's default is to be impatient and do a link reset on consumer drives that overthink a simple problem. Upon link reset, the entire command queue in the drive is lost, and now there's no way to know what sector it was hanging on, and no way for raid to do a fixup. The fixup mechanism is, the drive reports an uncorrectable read error with a sector address *only once it gives up*. And then the md raid (and btrfs and zfs) can go lookup that sector, find out what data is on it, go find its mirror, read the good data, and overwrite the bad sector with good data. The overwrite is what fixes the problem.
If the drive doesn't support SCT ERC, we have to get the kernel to be more patient. That's done via sysfs.
I use udev for that instead of init scripts. Concept is the same though, you want SCT ERC time to be shorter than kernel's command timer.
I´ve been using MD for a while and haven´t seen any errors so far.
And you may never see it. Or you may end up being an unlucky person with a raid who experiences complete loss of the array. When I say comes up all the time on linux-raid@ list, it's about once every couple of weeks. It's seen most often with raid5 because it has more drives, thus more failures, than raid1 setups. And tolerates only one failure *in a stripe*. Most everyone considers a failure a complete drive failure, but drives also partially fail. Two drives partially failing the sectors in the same stripe is pretty astronomical. But if one drive dies, and *any* of the remaining drives has a bad sector that can't be read, the entire stripe is lost. And depending on what's in that stripe, it can bring down the array.
So, what you want is for the drives to report their errors, rather than the kernel doing link resets.
On Thu, 2021-04-01 at 23:52 -0600, Chris Murphy wrote:
It's not an SMR concern, it's making sure the drive gives up on errors faster than the kernel tries to reset due to what it thinks is a hanging drive.
smartctl -l scterc /dev/sdX
That'll tell you the default setting. I'm pretty sure Blues come with SCT ERC disabled. Some support it. Some don't. If it's supported you'll want to set it for something like 70-100 deciseconds (the units SATA drives use for this feature).
One doesn´t and one does:
# smartctl -l scterc /dev/sdd smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
# smartctl -l scterc /dev/sde smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control: Read: 85 (8.5 seconds) Write: 85 (8.5 seconds)
So I guess the /dev/sde drive is set correctly, right? Or would you recommend disabling SCT ERC for this drive?
poc
On Fri, Apr 2, 2021 at 4:23 AM Patrick O'Callaghan pocallaghan@gmail.com wrote:
On Thu, 2021-04-01 at 23:52 -0600, Chris Murphy wrote:
It's not an SMR concern, it's making sure the drive gives up on errors faster than the kernel tries to reset due to what it thinks is a hanging drive.
smartctl -l scterc /dev/sdX
That'll tell you the default setting. I'm pretty sure Blues come with SCT ERC disabled. Some support it. Some don't. If it's supported you'll want to set it for something like 70-100 deciseconds (the units SATA drives use for this feature).
One doesn´t and one does:
# smartctl -l scterc /dev/sdd smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
# smartctl -l scterc /dev/sde smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control: Read: 85 (8.5 seconds) Write: 85 (8.5 seconds)
So I guess the /dev/sde drive is set correctly, right? Or would you recommend disabling SCT ERC for this drive?
Leave /dev/sde alone, 85 deciseconds is fine.
Not much can be done with /dev/sdd itself directly. But it is possible to increase the kernel's command timer for this drive. The usual way of doing this is via sysfs. I think it can be done with a udev rule as well, but I'm having a bit of a lapse how to do it. Udev needs to identify the device by serial number or wwn, but changing the timeout via sysfs requires knowing that the /dev node is - which of course can change each time you boot or plug the device in. I don't know enough about udev. But there should be examples on the internet or you can just fudge it with the linux-raid wiki guide.
The alternatives? Change the timeout for all /dev/ nodes. That's how things are by default on Windows and macOS, they just wait a long time before resetting a drive, giving it enough time for it to give up on its own. The negative side effect is you might get a long delay without errors, should the device develop marginally bad sectors.
Another alternative is to just leave it alone, and periodically check (manually or automate it somehow) for the telltale signs of bad sectors masked by SATA link resets.
Looks like this:
kernel: ata7.00: status: { DRDY } kernel: ata7.00: failed command: READ FPDMA QUEUED kernel: ata7.00: cmd 60/40:f0:98:d2:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
With this interlaced occasionally
kernel: ata7: hard resetting link
If it happens *then* you can increase the timeout manually, and initiate a scrub. As long as the timeout is set high enough (most sources suggest 180 seconds which, yes, it's incredible) eventually the drive will give up, spit out an error, and Btrfs will fix up that sector by overwriting it with good data. It could be months, years, or never, before it happens.
I turn my scterc down as low as the drive will allow. How low I can go varies by model. I have a loop that starts at 70 and then keeps going down such that it will end up setting each disk as low is allowed as far down as 10. My wd reds allow a min of 20, and I have a seagate that allows 10.
But even set to 10 (1.0 sec) at a slow 15ms/retry that is 66 retries reading the block, if it has not got it in that many tries then the drive might as well give up and let mdadm or something rewrite good data to the disk.
On Fri, Apr 2, 2021 at 2:13 PM Chris Murphy lists@colorremedies.com wrote:
On Fri, Apr 2, 2021 at 4:23 AM Patrick O'Callaghan pocallaghan@gmail.com wrote:
On Thu, 2021-04-01 at 23:52 -0600, Chris Murphy wrote:
It's not an SMR concern, it's making sure the drive gives up on errors faster than the kernel tries to reset due to what it thinks is a hanging drive.
smartctl -l scterc /dev/sdX
That'll tell you the default setting. I'm pretty sure Blues come with SCT ERC disabled. Some support it. Some don't. If it's supported you'll want to set it for something like 70-100 deciseconds (the units SATA drives use for this feature).
One doesn´t and one does:
# smartctl -l scterc /dev/sdd smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
# smartctl -l scterc /dev/sde smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control: Read: 85 (8.5 seconds) Write: 85 (8.5 seconds)
So I guess the /dev/sde drive is set correctly, right? Or would you recommend disabling SCT ERC for this drive?
Leave /dev/sde alone, 85 deciseconds is fine.
Not much can be done with /dev/sdd itself directly. But it is possible to increase the kernel's command timer for this drive. The usual way of doing this is via sysfs. I think it can be done with a udev rule as well, but I'm having a bit of a lapse how to do it. Udev needs to identify the device by serial number or wwn, but changing the timeout via sysfs requires knowing that the /dev node is - which of course can change each time you boot or plug the device in. I don't know enough about udev. But there should be examples on the internet or you can just fudge it with the linux-raid wiki guide.
The alternatives? Change the timeout for all /dev/ nodes. That's how things are by default on Windows and macOS, they just wait a long time before resetting a drive, giving it enough time for it to give up on its own. The negative side effect is you might get a long delay without errors, should the device develop marginally bad sectors.
Another alternative is to just leave it alone, and periodically check (manually or automate it somehow) for the telltale signs of bad sectors masked by SATA link resets.
Looks like this:
kernel: ata7.00: status: { DRDY } kernel: ata7.00: failed command: READ FPDMA QUEUED kernel: ata7.00: cmd 60/40:f0:98:d2:2b/05:00:45:00:00/40 tag 30 ncq dma 688128 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
With this interlaced occasionally
kernel: ata7: hard resetting link
If it happens *then* you can increase the timeout manually, and initiate a scrub. As long as the timeout is set high enough (most sources suggest 180 seconds which, yes, it's incredible) eventually the drive will give up, spit out an error, and Btrfs will fix up that sector by overwriting it with good data. It could be months, years, or never, before it happens.
-- Chris Murphy _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Fri, 2021-04-02 at 13:12 -0600, Chris Murphy wrote:
# smartctl -l scterc /dev/sdd smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
# smartctl -l scterc /dev/sde smartctl 7.2 2021-01-17 r5171 [x86_64-linux-5.11.10-200.fc33.x86_64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control: Read: 85 (8.5 seconds) Write: 85 (8.5 seconds)
So I guess the /dev/sde drive is set correctly, right? Or would you recommend disabling SCT ERC for this drive?
Leave /dev/sde alone, 85 deciseconds is fine.
OK.
Not much can be done with /dev/sdd itself directly. But it is possible to increase the kernel's command timer for this drive. The usual way of doing this is via sysfs.
Tried that and it seems to work.
I think it can be done with a udev rule as well, but I'm having a bit of a lapse how to do it. Udev needs to identify the device by serial number or wwn, but changing the timeout via sysfs requires knowing that the /dev node is - which of course can change each time you boot or plug the device in. I don't know enough about udev. But there should be examples on the internet or you can just fudge it with the linux-raid wiki guide.
This came up in a Google search:
https://github.com/jonathanunderwood/mdraid-safe-timeouts
It´s for MD arrays, but I´m guessing it could be adapted for BTRFS as well.
poc
PS One thing that slightly confused me in setting up the RAID array is that, unlike with MD, there´s no new pseudo-device, and you mount the array by just mounting one of its components, right? (I´ve already done it and it works, just checking if I´ve missed something).
On Fri, 2021-04-02 at 13:12 -0600, Chris Murphy wrote:
[...] changing the timeout via sysfs requires knowing that the /dev node is - which of course can change each time you boot or plug the device in.
In fact I've already labelled the devices so identifying them can be done by filtering the output of 'lsblk'. Certainly simpler than the udev alternative.
poc