I've been running MythTV for about 10 years now and I've finally outgrown my media storage, currently a single 4TB disk drive. I have purchased 3 additional drives of the same model and plan to put them into a BTRFS RAID1 array.
Setting nodatacow on the media directories is a no-brainer, but what other optimizations can I do?
Thanks, Richard
On Wed, Mar 24, 2021 at 3:47 PM Richard Shaw hobbes1069@gmail.com wrote:
I've been running MythTV for about 10 years now and I've finally outgrown my media storage, currently a single 4TB disk drive. I have purchased 3 additional drives of the same model and plan to put them into a BTRFS RAID1 array.
Setting nodatacow on the media directories is a no-brainer, but what other optimizations can I do?
nodatacow means nodatasum and no compression. If the file becomes corrupt, it can't be detected or corrected.
Due to all the firmware bugs, I tend to mix make/model drives rather than get the same ones that are all going to have the same bug at the time time if something goes wrong like a power fail or crash right after going a bunch of writes. Whereas separate bugs, btrfs can always do fix ups from the good drive whenever the bad one misbehaves.
If they've proven their reliability in case of crash or power fail, as in start doing a big file(s) copy, and yank power on all the drives at once: Reboot. Reattach the drives. Remount. Any errors? Does it mount? If you can do this 10x without errors, it might be OK.
Still, I'd probably opt to set a udev rule for all the drives and disable the write cache. It isn't worth the trouble.
-- Chris Murphy
On Wed, Mar 24, 2021 at 11:26 PM Chris Murphy lists@colorremedies.com wrote:
On Wed, Mar 24, 2021 at 3:47 PM Richard Shaw hobbes1069@gmail.com wrote:
I've been running MythTV for about 10 years now and I've finally
outgrown my media storage, currently a single 4TB disk drive. I have purchased 3 additional drives of the same model and plan to put them into a BTRFS RAID1 array.
Setting nodatacow on the media directories is a no-brainer, but what
other optimizations can I do?
nodatacow means nodatasum and no compression. If the file becomes corrupt, it can't be detected or corrected.
Ok, so that may be a problem. I'm not worried about compression since it's all MPEG2/H264 files anyway but the nodatasum would defeat the purpose...
Due to all the firmware bugs, I tend to mix make/model drives rather than get the same ones that are all going to have the same bug at the time time if something goes wrong like a power fail or crash right after going a bunch of writes. Whereas separate bugs, btrfs can always do fix ups from the good drive whenever the bad one misbehaves.
So how long do you wait until you consider the drive "good"? :)
I'm not in a hurry so I could setup two of the drives in a RAID1 mirror and copy my media over and just let it run for a while before I add disks 3 & 4.
If they've proven their reliability in case of crash or power fail, as in start doing a big file(s) copy, and yank power on all the drives at once: Reboot. Reattach the drives. Remount. Any errors? Does it mount? If you can do this 10x without errors, it might be OK.
Regardless of which computer it is in the house, critical stuff is backed up on a BackupPC server and the super critical stuff also backed up on multiple cloud services (Google Drive, SpiderOak, Dropbox, etc), although it has been a while since I burned a disc and put it in the fire safe :)
I would hate to lose all the media but it wouldn't be the end of the world for me so I think 10x is good enough :)
Still, I'd probably opt to set a udev rule for all the drives and
disable the write cache. It isn't worth the trouble.
I know you mentioned this on another thread but google is failing me as to how to do it. Do you have a useful link?
Thanks, Richard
Make sure to get NAS type drives. The non-Enterprise, Non-Nas drives usually won't timeout for 2-3 minutes. The NAS drives typically can be set 7 seconds or less. You also want to evaluate setting the timeout lower. And watch out for the SMR disks, get CMR ones. The SMR's are said to suck with both btrfs and raid. I have been using 3TB WD Reds, but got burned with the 1.5TB and 3TB seagates. 2 of 3 of my seagates 3TB did not make 3 years. My newest of my 6 3TB WD's are >3 years old and the oldest one is >7 and they only have a few bad blocks. Maybe the newer seagates are decent, but I got burned on 2 separate generations where a significant number had to be replaced and/or died under 4 years. And most of the replacements not last last more than 3 years.
Given you will lose data when the machine reboots and aborts whatever is running, often disabling the write cache on the drive is not worth it unless you have critical data. Losing a few seconds of the recordings does not affect watching the shows for the most part.
You might also want to get a 500gb ssd and record directly to that and move to the long term storage nightly. Mythtv syncs often and starts aborting recordings when it gets a few seconds behind. Basically anytime one of the drives blip, and once the drives get older that will happen from time to time and it is just easier to have it on the non-mirrored ssd. You could if you were really worried about leaving it on the ssd for a 24 hours setup a simple find takes to move any recording file that has not been modified in a few minutes.
On Thu, Mar 25, 2021 at 7:00 AM Richard Shaw hobbes1069@gmail.com wrote:
On Wed, Mar 24, 2021 at 11:26 PM Chris Murphy lists@colorremedies.com wrote:
On Wed, Mar 24, 2021 at 3:47 PM Richard Shaw hobbes1069@gmail.com wrote:
I've been running MythTV for about 10 years now and I've finally outgrown my media storage, currently a single 4TB disk drive. I have purchased 3 additional drives of the same model and plan to put them into a BTRFS RAID1 array.
Setting nodatacow on the media directories is a no-brainer, but what other optimizations can I do?
nodatacow means nodatasum and no compression. If the file becomes corrupt, it can't be detected or corrected.
Ok, so that may be a problem. I'm not worried about compression since it's all MPEG2/H264 files anyway but the nodatasum would defeat the purpose...
Due to all the firmware bugs, I tend to mix make/model drives rather than get the same ones that are all going to have the same bug at the time time if something goes wrong like a power fail or crash right after going a bunch of writes. Whereas separate bugs, btrfs can always do fix ups from the good drive whenever the bad one misbehaves.
So how long do you wait until you consider the drive "good"? :)
I'm not in a hurry so I could setup two of the drives in a RAID1 mirror and copy my media over and just let it run for a while before I add disks 3 & 4.
If they've proven their reliability in case of crash or power fail, as in start doing a big file(s) copy, and yank power on all the drives at once: Reboot. Reattach the drives. Remount. Any errors? Does it mount? If you can do this 10x without errors, it might be OK.
Regardless of which computer it is in the house, critical stuff is backed up on a BackupPC server and the super critical stuff also backed up on multiple cloud services (Google Drive, SpiderOak, Dropbox, etc), although it has been a while since I burned a disc and put it in the fire safe :)
I would hate to lose all the media but it wouldn't be the end of the world for me so I think 10x is good enough :)
Still, I'd probably opt to set a udev rule for all the drives and disable the write cache. It isn't worth the trouble.
I know you mentioned this on another thread but google is failing me as to how to do it. Do you have a useful link?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Thu, Mar 25, 2021 at 6:00 AM Richard Shaw hobbes1069@gmail.com wrote:
So how long do you wait until you consider the drive "good"? :)
I'm not in a hurry so I could setup two of the drives in a RAID1 mirror and copy my media over and just let it run for a while before I add disks 3 & 4.
We don't have much control over when and how bugs manifest. It could be true that drive firmware is improperly reordering 1 in 100 commits, or only a particular sequence of flushing, or always. Such behavior not a problem by itself, it takes a power fail or a crash to expose the problem at just the right time. The whole point of write ordering is to make sure the file system is consistent.
In the case of btrfs, the write order is simplistically: data->metadata->flush/fua->superblock->flush/fua
What makes this safe with copy on write is no data or metadata is overwritten, so there's no in between state. All the data writes in a commit are represented by metadata (the btrees) in that commit, and those trees aren't pointed to as current and valid until they are on stable media. That's the point of the first flush. Then comes the superblock which is what points to the new trees.
The problem happens if the super is written pointing to new trees before all the metadata (or data) has arrived on stable media. And then you get a power fail or crash. Now the super block is pointing to locations that don't have consistent tree state and you get some variation on mount failure.
Still, I'd probably opt to set a udev rule for all the drives and disable the write cache. It isn't worth the trouble.
I know you mentioned this on another thread but google is failing me as to how to do it. Do you have a useful link?
hdparm -W0 but double check that, there's w and W and one of them controls write cache, the other is dangerous.
$ cat /etc/udev/rules.d/69-hdparm.rules ACTION=="add", SUBSYSTEM=="block", \ KERNEL=="sd*[!0-9]", \ ENV{ID_SERIAL_SHORT}=="WDZ47F0A", \ RUN+="/usr/sbin/hdparm -B 100 -S 252 /dev/disk/by-id/wwn-0x5000c500a93cae8a" $
On Fri, Mar 26, 2021 at 9:41 AM Chris Murphy lists@colorremedies.com wrote:
On Thu, Mar 25, 2021 at 6:00 AM Richard Shaw hobbes1069@gmail.com wrote:
So how long do you wait until you consider the drive "good"? :)
I'm not in a hurry so I could setup two of the drives in a RAID1 mirror
and copy my media over and just let it run for a while before I add disks 3 & 4.
We don't have much control over when and how bugs manifest. It could be true that drive firmware is improperly reordering 1 in 100 commits, or only a particular sequence of flushing, or always. Such behavior not a problem by itself, it takes a power fail or a crash to expose the problem at just the right time. The whole point of write ordering is to make sure the file system is consistent.
In the case of btrfs, the write order is simplistically: data->metadata->flush/fua->superblock->flush/fua
What makes this safe with copy on write is no data or metadata is overwritten, so there's no in between state. All the data writes in a commit are represented by metadata (the btrees) in that commit, and those trees aren't pointed to as current and valid until they are on stable media. That's the point of the first flush. Then comes the superblock which is what points to the new trees.
The problem happens if the super is written pointing to new trees before all the metadata (or data) has arrived on stable media. And then you get a power fail or crash. Now the super block is pointing to locations that don't have consistent tree state and you get some variation on mount failure.
Ok, so I'm struggling a bit here :)
I appreciate all the detailed response, but at the same time the answers seem often bi-polar... You can do all these great things with BTRFS! But even if you test your raid array multiple times, a bad firmware may still eat all your data :)
I know you can't give absolute answers sometimes, but it feels like you're often a btrfs cheerleader and critic at the same time :)
It sounds like there needs to be an easy to find list of known good and known bad drives to use (be it btrfs or other similar filesystem).
My plan is to use Seagate Terascale drives which as far as I can tell are CMR and not SMR at least.
Thanks, Richard
On Sun, 28 Mar 2021 at 14:47, Richard Shaw hobbes1069@gmail.com wrote:
Ok, so I'm struggling a bit here :)
I appreciate all the detailed response, but at the same time the answers seem often bi-polar... You can do all these great things with BTRFS! But even if you test your raid array multiple times, a bad firmware may still eat all your data :)
That is true for any filesystem. Having worked with spinning disks for several decades, I have seen all sorts of failures.
I know you can't give absolute answers sometimes, but it feels like you're often a btrfs cheerleader and critic at the same time :)
It sounds like there needs to be an easy to find list of known good and
known bad drives to use (be it btrfs or other similar filesystem).
At various times there have been reports with statistics on failure rates from places like Google that use huge numbers of disks. The problem is that modern drives are generally quite reliable when used properly, and new models are introduced so frequently that by the time you have statistics, the current offerings are different.
There have also been efforts to predict eminent drive failure (e.g., using S.M.A.R.T) but without much success.
There are things that will kill drives -- like a runaway batch workflow that fills the disk over a long weekend and continuously restarts each time write fails or erratic power supplies.
My plan is to use Seagate Terascale drives which as far as I can tell are CMR and not SMR at least.
Your plan needs to consider backups and/or replication (to cloud or another site). It is easy and cheap to lose data. Not losing data is not easy and not cheap.
On Sun, 2021-03-28 at 19:30 -0300, George N. White III wrote:
There have also been efforts to predict eminent drive failure (e.g., using S.M.A.R.T) but without much success.
It took me a moment to wonder what would be famous/respected about drive failures. ;-) But I've often wondered if SMART does anything useful. If it detects an imminent problem it needs to notify you about it, and with a warning that's understandable.
I used to see system emails like this:
The following warning/error was logged by the smartd daemon: Device: /dev/sdb, 4 Offline uncorrectable sectors For details see host's SYSLOG (default: /var/log/messages).
Which were useful to me, but probably obscure to a lot of people. That was on a system with two drives, one in use and one bodgy one for testing, and the errors never increased over several years. It was always consistently telling me that.
I'm recently seeing info like this in logwatch emails:
**Unmatched Entries** Device: /dev/sda [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
Which makes little sense to me. The system is a 24/7 server, not often rebooted. It's a solid state drive, and I don't know what the hex that means (pun intended). I've no idea if that's an error, or if it's just telling me that drive has changed modes (idle/active).
And I don't know what kind of warnings people get who don't have system emails anymore.
Logically I'd expect that if SMART thought the drive might need checking or chucking, it'd start to give me useful warnings ahead of time, and I might be lucky enough to backup my files before disaster struck. But the warnings ain't that useful. And, of course, it's entirely possible for a drive to spontaneously fail before any scheduled SMART test took place.
On Sun, Mar 28, 2021 at 11:47 AM Richard Shaw hobbes1069@gmail.com wrote:
I appreciate all the detailed response, but at the same time the answers seem often bi-polar... You can do all these great things with BTRFS! But even if you test your raid array multiple times, a bad firmware may still eat all your data :)
It's uncommon even in the single device case. It takes a combination: a crash or power failure happening at the same time the firmware does something wrong. If you add raid1 to the mix, it's even less common all problems happen at the same time to two drives, in particular if the two drives don't share firmware bugs.
But if there is a problem, there's a decent (not certain) chance of recovery by doing:
mount -o rescue=usebackuproot
There are three backup roots. If one is good, mount succeeds and nothing more need be done. But you might lose up to a couple of minutes of writes. Basically Btrfs does a rollback to a known good root and drops all the bad ones. Is this eating your data? Yes, as much as 2 minutes worth.
If that doesn't work, what comes next depends on the kernel messages shown when the mount fails. And that unfortunately takes familiarity with Btrfs mount failure messages. My advice if anyone runs into a problem is to report a complete dmesg, and btrfs check --readonly.
And in the meantime they can try: mount -o rescue=all
That is a read-only mount that can skip over many kinds of significant root tree damage, and you can still get your data out using normal tools like rsync, cp, or any other backup program. And freshen up the backups just in case the problem can't be fixed.
If that doesn't work there's 'btrfs restore' which is an offline scrape utility. Ugly, but again data not eaten.
I know you can't give absolute answers sometimes, but it feels like you're often a btrfs cheerleader and critic at the same time :)
Yes.
It sounds like there needs to be an easy to find list of known good and known bad drives to use (be it btrfs or other similar filesystem).
Yes but that takes a lot of data and work to find the trends and then publish it and maintain it.
On Sun, Mar 28, 2021 at 4:31 PM George N. White III gnwiii@gmail.com wrote:
Your plan needs to consider backups and/or replication (to cloud or another site). It is easy and cheap to lose data. Not losing data is not easy and not cheap.
Exactly this. If the data is important, it's backed up. If it's not backed up, in some sence both fair and unfair, the data isn't important to you.
And I mean, it's an unfair penalty to find out the importance of backups via data loss. It's a disproportionate penalty.
Raid1 is not a backup. If you accidentally delete a file, for sure raid1 will ensure both mirrors record the accident.
Snapshots aren't a backup, because they aren't fully independent from either the file system or hardware. (They can make it easier to do a 'btrfs restore' in some cases - but I don't want folks thinking they can depend on restore, it's seriously a PITA. Very capable, but tedious and requires patience.)
In aggregate, with large volume users with consumer hardware, we aren't seeing Btrfs fall over more often than other file systems. It's true that it's harder to fix if it gets broken. It's also true you have multiple chances of recovery if it gets broken but all of them are less convenient than having a backup.
Nevertheless, if it breaks, we still want bug reports because, to the degree that it is possible, we don't want it breaking.
Btrfs is quite noisy (in dmesg) when there are problems. It's not subtle about it, by design. Corruption of any kind is not normal, the source should be found. Persistent stats are also retained in the file system metadata, retrievable by:
btrfs device stats /mntpoint or /dev/ node(s)
On Sun, Mar 28, 2021 at 6:51 PM Tim via users users@lists.fedoraproject.org wrote:
I'm recently seeing info like this in logwatch emails:
**Unmatched Entries** Device: /dev/sda [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)Which makes little sense to me. The system is a 24/7 server, not often rebooted. It's a solid state drive, and I don't know what the hex that means (pun intended). I've no idea if that's an error, or if it's just telling me that drive has changed modes (idle/active).
I'm not sure. The smartmontools folks probably know. They have a mailing list. Ask them and let us know?
Logically I'd expect that if SMART thought the drive might need checking or chucking, it'd start to give me useful warnings ahead of time, and I might be lucky enough to backup my files before disaster struck. But the warnings ain't that useful. And, of course, it's entirely possible for a drive to spontaneously fail before any scheduled SMART test took place.
The trend seems to be it can be semi-useful for HDD. The pending sectors and seek errors going up quickly is a pretty good warning something is going wrong.
For SSDs, your first indication for prefail is often checksum errors, at least in Btrfs land. Either it returns garbage or zeros. It seems the drive itself is not likely to report an uncorrectable read or write error like you'd get on HDD. And the next level of failure for SSD is it goes read-only. Not the file system, the drive - and this too is totally silent in my handful of experiences with this failure mode, happily accepts all write command without error but none of them are persistent. Another mode of SSD failure that's common is, it just vanishes off the bus. Does not say bye! It just goes bye!
So after about 12 forced power offs while copying data (via rsync) this is the only output from dmesg:
# dmesg | grep -i btrfs [ 0.776375] Btrfs loaded, crc32c=crc32c-generic, zoned=yes [ 5.497241] BTRFS: device fsid d9a2a011-77a2-43be-acd1-c9093d32125b devid 1 transid 389 /dev/sdc scanned by systemd-udevd (732) [ 5.521210] BTRFS: device fsid d9a2a011-77a2-43be-acd1-c9093d32125b devid 2 transid 389 /dev/sdd scanned by systemd-udevd (743) [ 6.743097] BTRFS info (device sdc): disk space caching is enabled [ 6.743100] BTRFS info (device sdc): has skinny extents [ 61.730008] BTRFS info (device sdc): the free space cache file (1833981444096) is invalid, skip it
Should I worry about the last line?
Thanks, Richard
On Mon, Mar 29, 2021 at 7:17 AM Richard Shaw hobbes1069@gmail.com wrote:
So after about 12 forced power offs while copying data (via rsync) this is the only output from dmesg:
# dmesg | grep -i btrfs [ 0.776375] Btrfs loaded, crc32c=crc32c-generic, zoned=yes [ 5.497241] BTRFS: device fsid d9a2a011-77a2-43be-acd1-c9093d32125b devid 1 transid 389 /dev/sdc scanned by systemd-udevd (732) [ 5.521210] BTRFS: device fsid d9a2a011-77a2-43be-acd1-c9093d32125b devid 2 transid 389 /dev/sdd scanned by systemd-udevd (743) [ 6.743097] BTRFS info (device sdc): disk space caching is enabled [ 6.743100] BTRFS info (device sdc): has skinny extents
This is the usual case. Btrfs kernel code reads the super which is only pointing to a completely consistent set of btrees. A variation on this is if a process making use of fsync was interrupted by the crash, you might also see this line:
BTRFS info (device vda3): start tree-log replay
That also is normal and fine. No fsck required, nor scrub required. You don't need to do anything. If the hardware is doing the correct thing, you can hit it with power failures while writing all day long and the file system will never care about it. This is by design. But also Btrfs developers wrote dm-log-writes expressly for power failure testing, and is now used in xfstests for regression testing all the common file systems in the kernel. So there's quite a lot of certainty that write ordering is what it should be.
[ 61.730008] BTRFS info (device sdc): the free space cache file (1833981444096) is invalid, skip it
Should I worry about the last line?
Nope. It's just gone stale, because it didn't get updated prior to the power fail. They rebuild quickly on their own. And it's not critical metadata, just an optimization.
(This is a reference to the v1 space cache which exist as hidden files in data block groups. The stable and soon to be default v2 space cache tree moves this to metadata block groups and it's quite a lot more resilient and more performant for very busy and larger file systems. Anyone can switch to v2 by doing a one time mount with the mount option space_cache=v2. Currently it needs to be a new mount, there's a bug that causes it to claim it's using v2 but it's not actually setting up the v2 tree. Once this mount option is used, a feature flag is set, and it'll always be used from that point on. It's not something that goes in fstab. Use it one time and forget about it. Sorta like a file system upgrade if you will. Note that large file systems might see a long first time mount with this option being initially set. I've anecdotally heard of it taking hours for a 40T file system, because the whole tree must be created and written before the mount can complete. For me on a full 1T file system it took *maybe* 1 minute.)
On Sun, 28 Mar 2021 at 21:51, Tim via users users@lists.fedoraproject.org wrote:
On Sun, 2021-03-28 at 19:30 -0300, George N. White III wrote:
There have also been efforts to predict eminent drive failure (e.g., using S.M.A.R.T) but without much success.
It took me a moment to wonder what would be famous/respected about drive failures. ;-) But I've often wondered if SMART does anything useful. If it detects an imminent problem it needs to notify you about it, and with a warning that's understandable.
I have obtained a warranty replacement on the basis of the S.M.A.R.T. report. For disk-intensive processing I recommend replacing drives before the warranty expires because the rate of failures increases shortly after end-or-warranty. The price of new drives is cheap compared to to the value of lost time dealing with a drive that fails in service, and I was usually able to double the capacity of the original drive.
I used to see system emails like this:
The following warning/error was logged by the smartd daemon: Device: /dev/sdb, 4 Offline uncorrectable sectors For details see host's SYSLOG (default: /var/log/messages).Which were useful to me, but probably obscure to a lot of people. That was on a system with two drives, one in use and one bodgy one for testing, and the errors never increased over several years. It was always consistently telling me that.
I'm recently seeing info like this in logwatch emails:
**Unmatched Entries** Device: /dev/sda [SAT], CHECK POWER STATUS spins up disk (0x81 ->0xff)
Which makes little sense to me. The system is a 24/7 server, not often rebooted. It's a solid state drive, and I don't know what the hex that means (pun intended). I've no idea if that's an error, or if it's just telling me that drive has changed modes (idle/active).
And I don't know what kind of warnings people get who don't have system emails anymore.
Gnome: https://developer.gnome.org/notification-spec/ uses dbus. https://sourceforge.net/projects/gsmartcontrol/
As usual, Arch has excellent documentation: https://wiki.archlinux.org/index.php/S.M.A.R.T. discusses notification strategies, including email and desktop.
Temperature and flooding are the most urgent out-of-bounds conditions. There are many systems for reporting these conditions using cell-phone technology and there are USB controlled switches/relays that could be used to trigger one of these systems.
Logically I'd expect that if SMART thought the drive might need checking or chucking, it'd start to give me useful warnings ahead of time, and I might be lucky enough to backup my files before disaster struck. But the warnings ain't that useful. And, of course, it's entirely possible for a drive to spontaneously fail before any scheduled SMART test took place.
For me, the most common advanced warning of a drive about to fail has been users complaining that their system is too slow. This is usually accompanied by some S.M.A.R.T. evidence despite a "healthy" status report. I also seen widespread problems with older drives after a winter power outage that made left the building much colder than normal.