[Fwd: F7: Howto monitoring a Hardware sata raid controller]

Fri Jul 6 11:34:56 UTC 2007

Tim:
>> Well, the manual off-line test didn't do anything helpful about
>> correcting errors when I tried it.  It discovered some, but that was
>> all.

Tony Nelson:
> If the error is correctable (the data could be read) then the disk will
> rewrite the data and remap the sector if it is bad.  If the data could not
> be read, then the sector is put in a list of offline-uncorrectable errors
> and the sector will be reallocated the next time it is read.

And if it can never read that sector, you see that error each and every
time.  Of course I have no idea which file is on the dodgy point, it's
not like you see "error trying to read named.conf" *then* the hda error
message, it's not obviously correlatable to anything.

> When the off-line long test (not the short one) is run, it has been more
> than 4 hours since the last test.

Yes, it was the long test, and it was done a few times over a few days,
as I did a few experiments with things.

> The advantage of the automatic offline test is that it sees each
> sector frequently, and may be able to rescue the data before it is
> lost.  What you did is not comparable.

There's no indication in the manual, or other files, that the auto tests
does anything other than test.  I didn't see anything, anywhere, that'd
indicate that things would get fixed.

Anyway, with this error (being unable to read something), the only fix
would have been trying to replace the (unknown) data with the correct
information from an external source.

>> I'd been giving a few duff drives a while back.  Things like that are
>> always useful to experiment with.  I did find that using dd if=/dev/zero
>> of=/dev/hdb to write to all sectors on the drive caused the drive to
>> sort itself out.  Afterwards it passed the SMART check with flying
>> colours.

> This is not the same as a low-level format.  What you did used some of the
> spare sectors, leaving fewer spare sectors for remapping in normal
> operation, and requiring a long seek for each spared sector.

Oh I know that.  It was just a way of exercising the drive with
unimportant data, so it can sort itself out.  It (the drive) found an
error, it moved that part of the drive out of further use.

Even an ordinary high-level format should have done the same trick,
though I'd imagine you'd have to do it twice (I'd forsee it finding an
error and complaining, the first time), but Linux doesn't really format
the drive during the install routine.

After doing that zeroing trick with the last drive, I formatted the
drives myself, using the checking option, to see if any problems were
discovered, before the install routine.  Nothing came up.

e.g. mkfs.ext3 -c -v -L lazarus/home /dev/sdb3

A drive check used to be an install option with older Red Hat releases.

>> The drives have, since, also been reliable in two test boxes with FC4, 6
>> & 7 over several months. ...

> Having sectors go bad is normal for modern hard drives, both because of
> their high density and their large number of sectors.  If sectors go bad at
> a low rate, it is simply a waste to replace the drive with one that will
> hopefully perform the same way and not just fail early.

That's my way of thinking, too.  They have some in-built level of error
handling, for that reason.  So long as the faults are insignificant,
you're okay.  I'm certainly not keen on coughing up $150 for a new
drive, *just* because two sectors stopped working, for instance.

-- 
(This box runs FC5, my others run FC4 & FC6, in case that's
 important to the thread.)

Don't send private replies to my address, the mailbox is ignored.
I read messages from the public lists.