understanding smart logs

Suvayu Ali fatkasuvayu+linux at gmail.com
Sun Aug 15 17:05:05 UTC 2010


Hi everyone,

Some background:
Recently my RAM went bad, and I realised it too late. Towards the last 
few of days my desktop had crashed more than once. Yesterday I received 
the replacement RAMs from RMA. On installing them and turning on my 
machine I noticed errors like these,

Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors

And I see that the errors started around about the time my desktop 
started crashing before I found the faulty RAMs.

The problem:
On subsequent boots it failed to boot, fsck complaining about disk read 
errors during a forced disk check. I was dropped to a read-only shell to 
troubleshoot everytime, so I ran fsck on all my partitions and found 
errors on my /home. The error messages said "inode has deleted or empty 
entries clear", "unlinked inode entries" and so on. Since I was on a 
read only partition I couldn't save them on a file (I guess paper would 
have worked :-p). When prompted by fsck to fix the errors, I answered yes.

On a reboot, my system booted properly but I had lost some very 
important data. All the missing directories were the ones which fsck had 
complained about. I restored whatever I could from some backups.

To confirm this as a one off incident and my disk hasn't gone bad I ran 
SMART tests, (this is a few month old drive)
# smartctl -t long /dev/sdb

But after the test I can't understand the output of the logs,

> # smartctl -a /dev/sdb
> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Caviar Black family
> Device Model:     WDC WD1001FALS-00E8B0
> Serial Number:    WD-WMATV5966482
> Firmware Version: 05.00K05
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Sat Aug 14 19:37:26 2010 PDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x84)	Offline data collection activity
> 					was suspended by an interrupting command from host.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      ( 121)	The previous self-test completed having
> 					the read element of the test failed.
> Total time to complete Offline
> data collection: 		 (18000) seconds.
> Offline data collection
> capabilities: 			 (0x7b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 ( 208) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (   5) minutes.
> SCT capabilities: 	       (0x3037)	SCT Status supported.
> 					SCT Feature Control supported.
> 					SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       1354
>   3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1158
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
>   9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1403
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       18
> 194 Temperature_Celsius     0x0022   112   107   000    Old_age   Always       -       38
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
> 197 Current_Pending_Sector  0x0032   199   199   000    Old_age   Always       -       172
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed: read failure       90%      1393         1106820646
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.

All the values in the table above seems larger than the threshold. But 
the report says PASSED. I'm not clear how to interpret this. Could 
someone help? Thanks a lot in advance.

-- 
Suvayu

Open source is the future. It sets us free.


More information about the users mailing list