On Sun, 28 Mar 2021 at 21:51, Tim via users users@lists.fedoraproject.org wrote:
On Sun, 2021-03-28 at 19:30 -0300, George N. White III wrote:
There have also been efforts to predict eminent drive failure (e.g., using S.M.A.R.T) but without much success.
It took me a moment to wonder what would be famous/respected about drive failures. ;-) But I've often wondered if SMART does anything useful. If it detects an imminent problem it needs to notify you about it, and with a warning that's understandable.
I have obtained a warranty replacement on the basis of the S.M.A.R.T. report. For disk-intensive processing I recommend replacing drives before the warranty expires because the rate of failures increases shortly after end-or-warranty. The price of new drives is cheap compared to to the value of lost time dealing with a drive that fails in service, and I was usually able to double the capacity of the original drive.
I used to see system emails like this:
The following warning/error was logged by the smartd daemon: Device: /dev/sdb, 4 Offline uncorrectable sectors For details see host's SYSLOG (default: /var/log/messages).Which were useful to me, but probably obscure to a lot of people. That was on a system with two drives, one in use and one bodgy one for testing, and the errors never increased over several years. It was always consistently telling me that.
I'm recently seeing info like this in logwatch emails:
**Unmatched Entries** Device: /dev/sda [SAT], CHECK POWER STATUS spins up disk (0x81 ->0xff)
Which makes little sense to me. The system is a 24/7 server, not often rebooted. It's a solid state drive, and I don't know what the hex that means (pun intended). I've no idea if that's an error, or if it's just telling me that drive has changed modes (idle/active).
And I don't know what kind of warnings people get who don't have system emails anymore.
Gnome: https://developer.gnome.org/notification-spec/ uses dbus. https://sourceforge.net/projects/gsmartcontrol/
As usual, Arch has excellent documentation: https://wiki.archlinux.org/index.php/S.M.A.R.T. discusses notification strategies, including email and desktop.
Temperature and flooding are the most urgent out-of-bounds conditions. There are many systems for reporting these conditions using cell-phone technology and there are USB controlled switches/relays that could be used to trigger one of these systems.
Logically I'd expect that if SMART thought the drive might need checking or chucking, it'd start to give me useful warnings ahead of time, and I might be lucky enough to backup my files before disaster struck. But the warnings ain't that useful. And, of course, it's entirely possible for a drive to spontaneously fail before any scheduled SMART test took place.
For me, the most common advanced warning of a drive about to fail has been users complaining that their system is too slow. This is usually accompanied by some S.M.A.R.T. evidence despite a "healthy" status report. I also seen widespread problems with older drives after a winter power outage that made left the building much colder than normal.