sector hard errors

Fri Mar 13 03:07:48 UTC 2015

On Thu, Mar 12, 2015 at 7:45 PM, Roger Heflin <rogerheflin at gmail.com> wrote:
> Unless you have the drive under raid that means that 15 sectors cannot
> be read and you have lost at least some data.
>
> The drives normally will not move sectors that it cannot successfully read.
>
> You may be able to copy the data off the disk, but you may when trying
> this find a lot more bad sectors that the 15 that are currently
> pending so may find your lost more data than 15 bad sectors would
> indicate.

Yes, could be true.

For a drive with 15 sectors pending reallocation so far on a 3 month
old drive, back it up, get it replaced under warranty. This fits the
profile for a drive in early failure. Most drives don't do this, but
of the drives in a batch that will exhibit early failure tend to do so
right about at 3 months.

Depending on what parts of the fs metadata are affected, 15 sectors
could possibly make a good portion of the volume unrecoverable.

> For the future my only suggestion is either to use raid and/or force
> the reading of the disk at least weekly so that the disk will detect
> "weak" sectors and either rewrite them or move them as needed.   On
> all of my critical disks I read and/or smartctl -t long all disks at
> least weekly.   From playing with a disk that is going bad it appears
> that doing the -t long daily might keep ahead of sectors going bad,
> but that means that the test runs the disk (though it can be accessed
> for normal usage) for several hours a day each day.

I'd say weekly is aggressive, but reasonable for an important array.
The scrubs are probably more valuable because any explicit read errors
get fixed up, whereas that's not necessarily the case for smartctl -t
long. I have several drives in enclosures with crap bridge chip sets,
so smartctl doesn't work, they've never had smart testing, none have
had read or write errors. I'd say if it takes even weekly extended
offline smart tests to avoid problems, the drive is bad.

Understood this thread's use case isn't raid, but it bears repeating:
By default, consumer drives like this one, will often attempt much
longer recoveries, beyond the SCSI command timer value of 30 seconds;
such recoveries get thwarted by the ensuing link reset, so the problem
sector(s) aren't revealed, and don't get fixed. This is a problem for
md, LVM, ZFS and Btrfs raid. So configuration has to be correct. This
is a monthly event on linux-raid@ (or even more often sometimes
several per week), and a high percent of the time all data on the raid
is lost in the ensuing recovery.

Backups!

-- 
Chris Murphy