Is SMART really that dumb?

Sat Mar 14 23:13:15 UTC 2015

On Sat, Mar 14, 2015 at 4:09 PM, Tom Horsley <horsley1953 at gmail.com> wrote:
> On Sat, 14 Mar 2015 16:53:15 -0500
> Roger Heflin wrote:
>
>> Also usually the errors are found by linux doing a read against it, so
>> there should be error messages on the reads in the messages file when
>> it happened, that is usually what I use to determine what sectors are
>> getting the error.
>
> Yea, I poked around in the logs and the very first thing
> that looks like any kind of error is the smart message
> showing up for the first time (and repeating every
> 30 minutes since then in an attempt to fill up the logs :-).

I'd say the first step is to confirm this is due to a media error
rather than something else, otherwise you end up down a rat hole.

The top post here is a good example of a URE due to media error.
http://ubuntuforums.org/archive/index.php/t-1034762.html

If the drive is attempting a recovery longer than 30 seconds, you'll
get errors along these lines (this is a write example, which is really
bad, the read version is more common).

[ 2161.457698] ata8.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action
0x6 frozen
[ 2161.457709] ata8.00: failed command: WRITE FPDMA QUEUED
[ 2161.457718] ata8.00: cmd 61/00:00:80:c4:2c/02:00:1e:00:00/40 tag 0
ncq 262144 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 2161.457723] ata8.00: status: { DRDY }
...
[ 5628.308982] ata8.00: failed command: WRITE FPDMA QUEUED
[ 5628.308990] ata8.00: cmd 61/80:50:80:34:44/01:00:50:00:00/40 tag 10
ncq 196608 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 5628.308993] ata8.00: status: { DRDY }
[ 5628.309000] ata8: hard resetting link
[ 5638.311674] ata8: softreset failed (1st FIS failed)
[ 5638.311686] ata8: hard resetting link

This is a how to on what to do about bad sectors, including partial recovery.
http://www.smartmontools.org/browser/trunk/www/badblockhowto.xml

But the tl;dr for all of that, in my opinion, is to update your
backups, and then obliterate the drive with writes. Only on a write
does the firmware determine if sector problems are transient or
persistent. If it's a persistent problem, then the LBA is reassigned
to a reserve sector. Once this is all done, then you can restore from
backups.

To do the write correctly, first you have to know if you have a 512n
or 512e drive. Most drives these days are 512e, or 512 byte logical,
4096 byte physical. The LBA error is for the first logical sector in
the bad physical sector. So writing over that 512 byte sector will not
work (it'll fail as a read error even though you're writing, due to a
read-modify-write attempt by the drive firmware). 'parted -l' will
tell you what type of drive you have is.

What I suggest is this:

# badblocks -b 4096 -svw /dev/sdX

This is destructive! Note that any block numbers that are reported by
badblocks at predicated on the -b value. So the reported value isn't a
sector LBA value. You have to multiply by 8 to get LBA. But after this
cycles through even once, the problem should be resolved. You could
let it run through all 8 passes (or whatever it is). What ought to be
true is you either get no errors (meaning all read errors weren't
media errors they were just bad data, like from torn writes or
something) or you get some write errors with reallocations on the
first pass. And no errors for subsequent passes. If any subsequent
passes have errors, especially corruption errors, then get rid of the
drive or turn it into a play thing or send it to me :-D

-- 
Chris Murphy