understanding smart logs
fatkasuvayu+linux at gmail.com
Sun Aug 15 17:05:05 UTC 2010
Recently my RAM went bad, and I realised it too late. Towards the last
few of days my desktop had crashed more than once. Yesterday I received
the replacement RAMs from RMA. On installing them and turning on my
machine I noticed errors like these,
Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors
And I see that the errors started around about the time my desktop
started crashing before I found the faulty RAMs.
On subsequent boots it failed to boot, fsck complaining about disk read
errors during a forced disk check. I was dropped to a read-only shell to
troubleshoot everytime, so I ran fsck on all my partitions and found
errors on my /home. The error messages said "inode has deleted or empty
entries clear", "unlinked inode entries" and so on. Since I was on a
read only partition I couldn't save them on a file (I guess paper would
have worked :-p). When prompted by fsck to fix the errors, I answered yes.
On a reboot, my system booted properly but I had lost some very
important data. All the missing directories were the ones which fsck had
complained about. I restored whatever I could from some backups.
To confirm this as a one off incident and my disk hasn't gone bad I ran
SMART tests, (this is a few month old drive)
# smartctl -t long /dev/sdb
But after the test I can't understand the output of the logs,
> # smartctl -a /dev/sdb
> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Caviar Black family
> Device Model: WDC WD1001FALS-00E8B0
> Serial Number: WD-WMATV5966482
> Firmware Version: 05.00K05
> User Capacity: 1,000,204,886,016 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 8
> ATA Standard is: Exact ATA specification draft version not indicated
> Local Time is: Sat Aug 14 19:37:26 2010 PDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 121) The previous self-test completed having
> the read element of the test failed.
> Total time to complete Offline
> data collection: (18000) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 208) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
> SCT capabilities: (0x3037) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354
> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403
> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172
> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
> SMART Error Log Version: 1
> No Errors Logged
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed: read failure 90% 1393 1106820646
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
All the values in the table above seems larger than the threshold. But
the report says PASSED. I'm not clear how to interpret this. Could
someone help? Thanks a lot in advance.
Open source is the future. It sets us free.
More information about the users