Questionable Status

Thu Oct 1 17:54:32 UTC 2009

On 09-10-01 09:09:40, Robin Laing wrote:
> Tony Nelson wrote:
> > On 09-09-23 09:29:56, Gene Poole wrote:
> >> I've very recently upgraded 2 of my machines.  One machine was
> >> upgraded from Fedora 9 to Fedora 11, and the other machine was 
> >> upgraded from Fedora 10 to Fedora 11.  On machine 1 I have 2-hard 
> >> disks (both Seagate's - 500 GB and 1000 GB), on machine 2 I have 
> >> 1- hard disk (Western Digital 320 GB).  All of the interfaces are
> >> SATA.  The questionable status is that on machine 1 the 500 GB 
> >> drive is showing as failing and on machine 2 the 20 GB drive is 
> >> showing as failing. Neither drive, under the old releases, showed 
> >> up as failing.  How do I know that these drive are truly failing?
> > 
> > 1) Wait.  If the disk is going bad, it will fail.
> > 
> > 2) Run as root `smartctl -A /dev/sdx` (for each sdx) and look at 
> > the "WHEN_FAILED" column; it will be "-" if not failed.
> > 
> > 3) Run as root `smartctl -a /dev/sdx` (for each sdx) and look at 
> > the whole output.
> > 
> > 4) Run as root `smartctl -t long /dev/sdx` (for each sdx) and wait 
> > until the time the test should finish, then view the results with 
> > `smartctl -l selftest /dev/sdx` (for each sdx) or `smartctl -a
> > /dev/sdx` (for each sdx).
> > 
> > See `man smartctl`.
> > 
> > Note that the new disk health monitoring tool "palimpsest" in 
> > package gnome-disk-utility is panicky and not to be trusted, unless 
> > you like buying lots of hard drives.  It doesn't just look at 
> > "WHEN_FAILED", but has its own criteria such as nonzero 
> > Reallocated_Event_Count, which is fairly normal for a modern drive 
> > that has been in use for a while.  A nonzero Current_Pending_Sector 
> > or Offline_Uncorrectable are bad, as they mean data loss, though 
> > not general drive failure.  I recommend enabling Automatic Offline 
> > Testing with `smartctl -o on /dev/sdx` (for 
> > each sdx), which will do a surface scan every few hours, giving the
> > best chance to repair or recover any sectors that are going bad.
> > 
> 
> Will the `smartctl -o on /dev/sdx` (for > each sdx), fix the nonzero 
> Reallocated_Event_Count issue on RAID arrays in a non-desctructive
> way? 

No.  Nor for non-RAID either.  It doesn't "fix" Reallocated_Event_Count
-- rather, its purpose is to make Reallocated_Event_Count go up faster, 
in that as soon as a sector starts to go bad it will be reallocated if 
readable, and the sooner the more likely it is possible.  A non-zero 
Reallocated_Event_Count is not a problem.  Whatever says it is a 
problem is the real problem.  Fix that instead.

Non-zero Current_Pending_Sector is a problem, but RAID should be fixing 
that already.  I don't know, but I think that enabling Automatic 
Offline Testing should cause any uncorrectable sectors to be noticed 
and fixed sooner by RAID.

>   Do you have to use the /dev/sdx devices or the /dev/md devices?
 ...

Automatic Offline Testing must be enabled on an actual ATA hard disk, 
so no fake disk such as dm or md.  See `man smartctl`.

-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>