understanding smart logs
jjmckenzie51 at earthlink.net
Sun Aug 15 17:17:12 UTC 2010
Suvayu Ali wrote:
> Hi everyone,
> Some background:
> Recently my RAM went bad, and I realised it too late. Towards the last
> few of days my desktop had crashed more than once. Yesterday I received
> the replacement RAMs from RMA. On installing them and turning on my
> machine I noticed errors like these,
> Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors
> And I see that the errors started around about the time my desktop
> started crashing before I found the faulty RAMs.
> The problem:
> On subsequent boots it failed to boot, fsck complaining about disk read
> errors during a forced disk check. I was dropped to a read-only shell to
> troubleshoot everytime, so I ran fsck on all my partitions and found
> errors on my /home. The error messages said "inode has deleted or empty
> entries clear", "unlinked inode entries" and so on. Since I was on a
> read only partition I couldn't save them on a file (I guess paper would
> have worked :-p). When prompted by fsck to fix the errors, I answered yes.
> On a reboot, my system booted properly but I had lost some very
> important data. All the missing directories were the ones which fsck had
> complained about. I restored whatever I could from some backups.
> To confirm this as a one off incident and my disk hasn't gone bad I ran
> SMART tests, (this is a few month old drive)
> # smartctl -t long /dev/sdb
> But after the test I can't understand the output of the logs,
>> # smartctl -a /dev/sdb
>> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
>> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>> === START OF INFORMATION SECTION ===
>> Model Family: Western Digital Caviar Black family
>> Device Model: WDC WD1001FALS-00E8B0
>> Serial Number: WD-WMATV5966482
>> Firmware Version: 05.00K05
>> User Capacity: 1,000,204,886,016 bytes
>> Device is: In smartctl database [for details use: -P show]
>> ATA Version is: 8
>> ATA Standard is: Exact ATA specification draft version not indicated
>> Local Time is: Sat Aug 14 19:37:26 2010 PDT
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>> General SMART Values:
>> Offline data collection status: (0x84) Offline data collection activity
>> was suspended by an interrupting command from host.
>> Auto Offline Data Collection: Enabled.
>> Self-test execution status: ( 121) The previous self-test completed having
>> the read element of the test failed.
>> Total time to complete Offline
>> data collection: (18000) seconds.
>> Offline data collection
>> capabilities: (0x7b) SMART execute Offline immediate.
>> Auto Offline data collection on/off support.
>> Suspend Offline collection upon new
>> Offline surface scan supported.
>> Self-test supported.
>> Conveyance Self-test supported.
>> Selective Self-test supported.
>> SMART capabilities: (0x0003) Saves SMART data before entering
>> power-saving mode.
>> Supports SMART auto save timer.
>> Error logging capability: (0x01) Error logging supported.
>> General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: ( 2) minutes.
>> Extended self-test routine
>> recommended polling time: ( 208) minutes.
>> Conveyance self-test routine
>> recommended polling time: ( 5) minutes.
>> SCT capabilities: (0x3037) SCT Status supported.
>> SCT Feature Control supported.
>> SCT Data Table supported.
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
>> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354
>> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158
>> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40
>> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
>> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
>> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403
>> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
>> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38
>> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
>> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18
>> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38
>> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
>> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172
>> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
>> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
>> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
>> SMART Error Log Version: 1
>> No Errors Logged
>> SMART Self-test log structure revision number 1
>> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
>> # 1 Extended offline Completed: read failure 90% 1393 1106820646
>> SMART Selective self-test log data structure revision number 1
>> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
>> 1 0 0 Not_testing
>> 2 0 0 Not_testing
>> 3 0 0 Not_testing
>> 4 0 0 Not_testing
>> 5 0 0 Not_testing
>> Selective self-test flags (0x0):
>> After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
> All the values in the table above seems larger than the threshold. But
> the report says PASSED. I'm not clear how to interpret this. Could
> someone help? Thanks a lot in advance.
Got a good backup of this drive? Looks like it needs to be retested, in
a different machine and if it fails, replaced.
I had a drive that exhibited the same behavior and eventually, it failed.
More information about the users