Hi,
I'm trying to rescue some data from a box where the hard disc *may* have failed. I had a few drive not ready notices, and needed to do fsck -y to fix all the numerous errors it complained about when trying to just do fsck, but then the problem originated from a wonky 4 pin power plug on the drive. So I don't know if the drive really failed, or it just got contents scrambled. If there really isn't anything wrong with the drive, I might as well keep using it. Are those "drive not ready" messages always about real drive hardware faults?
Now, if I try to boot up normally, there's a lot of protests about not being able to find the users needed by some services (root, named, and so on...), nor any user. So it looks like some important files got borked, but I'm not sure what files are used for authentication, these days. The obvious password and group files looks okay. Any clues on this front? I've got other very similarly set up boxes I could copy files from.
This PC, unfortunately, I let set itself up with volumegroups, but my other PCs I set up with ordinary ext3 partitions. I could probably stick the drive into another PC without too many problems.
If I boot up with init=/bin/sh on the end of the kernel line, the computer does boot, and I seem to be able to read the data that I want to save (mostly local webpages, and a few server configuration files).
Tim wrote:
I'm trying to rescue some data from a box where the hard disc *may* have failed.
You know about the smartctl -l error /dev/hda command, don't you? That will show you if the *disk* thinks it needs replacing.
I had a few drive not ready notices, and needed to do fsck -y to fix all the numerous errors it complained about when trying to just do fsck, but then the problem originated from a wonky 4 pin power plug on the drive. So I don't know if the drive really failed, or it just got contents scrambled. If there really isn't anything wrong with the drive, I might as well keep using it. Are those "drive not ready" messages always about real drive hardware faults?
You've got good backups anyway, right? (And you say you can get at the data you need). It sounds like you really need to reinstall this box.
Tim:
I'm trying to rescue some data from a box where the hard disc *may* have failed.
James Wilkinson
You know about the smartctl -l error /dev/hda command, don't you? That will show you if the *disk* thinks it needs replacing.
Not that particular command, though I am a bit familiar playing with smart data and hard drives. I had the smart daemon running, and configured for that drive, but it hadn't produced any warnings. That's one thing that made me suspicious more of a filing system error than hardware error. Likewise, the BIOS is supposed to check on smart details.
You've got good backups anyway, right?
Not of the couple of bits that I want. Isn't that always the way? All the valuable stuff is backed up in several places.
It sounds like you really need to reinstall this box.
Yes, that's the intention. It's really only a test box, hence it doesn't have much on it. Unfortunately, it's the last lot of tests that I want to resume working on, and they're on this box (webserving experiments). If I can't, it just means more typing than I want to, to resume, and remembering how far I got. The sort of backups I've done for this, are along the lines of httpd.conf.backup duplicates in a few places, just not a recent enough one on another box, unfortunately. Nothing really more valuable than that. Though this latest minor disaster is turning into a useful thing in itself - learning to recover a box, but not when it's vital. Probably the best time to learn how to do that.
Current status: I had a bit of advice from a friend to do a "remount -o remount,rw /" on it (since, / was read-only despite the mount list showing it as read & write). That allowed me to log in, swap a password file in /etc/ so it's boot *almost* normally. Now I'll NFS off a few files I'd like to keep (working nicely), see what else I can break or fix (not done this yet), then start afresh so it's reliable (tomorrow, maybe).
So far I've found: Something screwy with the passwords, so that not all users details are as they should. The file looks fine, so there might be a non-printing character somewhere that I can't see. GDM won't let me log-in, but that's no drama.
The hardest thing is going to be a re-install, simply because it has no floppy or CD-ROM, I'll have to pull the box out of the shelf and open it up. That's about as difficult as it gets. Easy peasy... I'm so glad this isn't Windows, and that I'm not faced with registry repairs!
This isn't the worst I've had to fix. I had a friend get so pissed off with his Windows box that he threw it about on the concrete floor. After I resoldered a chip that flew off his graphics card, resoldered another on his motherboard that was close to falling off, and reseated a few cards (which, either this, or the chips that came off having dry joints, would have been the probable cause of the failure), the machine worked fine, surprisingly (despite all the physical abuse; though he did, at least, have the sense to remove the hard drive before the torture). I was quite amused.
On Wed, 2006-09-20 at 17:57 +0100, James Wilkinson wrote:
You know about the smartctl -l error /dev/hda command, don't you? That will show you if the *disk* thinks it needs replacing.
Output below. Do you think this is an avoidable, or unavoidable error?
This is a test box, made from spare parts at zero cost. Yep, the whole thing was scrounged. Beyond the annoyance of interrupting some experiments, 100% reliability isn't the absolute top priority. It's there to try things out. My other PCs are a different matter, I do spend money on them. I'm not adverse to partitioning around a duff part of the drive to plod on and see how it goes.
I am amused by the statement about the device was active or idle when the error occurred. What else could it be? ;-) It sounds a lot like, "he was breathing, just before he died..."
------------------------------
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION === SMART Error Log Version: 1 ATA Error Count: 541 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 541 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 bd 34 9b e1 Error: UNC 8 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 bd 34 9b e1 00 00:20:09.950 READ DMA c8 00 08 b5 34 9b e1 00 00:20:09.950 READ DMA c8 00 08 ad 34 9b e1 00 00:20:09.950 READ DMA c8 00 08 a5 34 9b e1 00 00:20:09.950 READ DMA c8 00 40 05 36 9b e1 00 00:20:09.950 READ DMA
Error 540 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 67 33 9b e1 Error: UNC 8 sectors at LBA = 0x019b3367 = 26948455
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 65 33 9b e1 00 00:20:01.500 READ DMA ca 00 30 1d 33 9b e1 00 00:20:01.500 WRITE DMA c8 00 08 5d 33 9b e1 00 00:20:01.500 READ DMA c8 00 40 c5 34 9b e1 00 00:20:01.500 READ DMA c8 00 48 bd 34 9b e1 00 00:19:55.650 READ DMA
Error 539 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 48 bd 34 9b e1 Error: UNC 72 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 48 bd 34 9b e1 00 00:19:55.650 READ DMA c8 00 50 b5 34 9b e1 00 00:19:50.100 READ DMA c8 00 58 ad 34 9b e1 00 00:19:44.000 READ DMA c8 00 60 a5 34 9b e1 00 00:19:38.300 READ DMA c8 00 68 9d 34 9b e1 00 00:19:32.700 READ DMA
Error 538 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 50 bd 34 9b e1 Error: UNC 80 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 50 b5 34 9b e1 00 00:19:50.100 READ DMA c8 00 58 ad 34 9b e1 00 00:19:44.000 READ DMA c8 00 60 a5 34 9b e1 00 00:19:38.300 READ DMA c8 00 68 9d 34 9b e1 00 00:19:32.700 READ DMA c8 00 70 95 34 9b e1 00 00:19:26.900 READ DMA
Error 537 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 58 bd 34 9b e1 Error: UNC 88 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 58 ad 34 9b e1 00 00:19:44.000 READ DMA c8 00 60 a5 34 9b e1 00 00:19:38.300 READ DMA c8 00 68 9d 34 9b e1 00 00:19:32.700 READ DMA c8 00 70 95 34 9b e1 00 00:19:26.900 READ DMA c8 00 78 8d 34 9b e1 00 00:19:22.100 READ DMA
I wrote:
You know about the smartctl -l error /dev/hda command, don't you? That will show you if the *disk* thinks it needs replacing.
Tim wrote:
Output below. Do you think this is an avoidable, or unavoidable error?
<snip, and the next bit is liberally snipped, too>
Error 541 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) 40 51 08 bd 34 9b e1 Error: UNC 8 sectors at LBA = 0x019b34bd = 26948797 Error 540 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) 40 51 48 bd 34 9b e1 Error: UNC 72 sectors at LBA = 0x019b34bd = 26948797 Error 538 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) 40 51 50 bd 34 9b e1 Error: UNC 80 sectors at LBA = 0x019b34bd = 26948797 Error 537 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) 40 51 58 bd 34 9b e1 Error: UNC 88 sectors at LBA = 0x019b34bd = 26948797
And from the smartctl man page: UNC: UNCorrectable Error in Data In other words, it's been trying to read data from that logical block address, and failing.
Since you say that this is a "scratch" test PC, I'd do a smartctl -H /dev/hda (which is probably what I should have told you in the first place). If that says "PASSED", I'd do a combination of dd if=/dev/zero of=/dev/hda to blank the drive (that should remap all the bad sectors), and dd if=/dev/hda of=/dev/null to read them all back. Then check for any more errors. If you don't get any, I'd trust the drive for testing purposes.
Obviously, get any data you care about off the drive, first!
Those dd commands will probably take several hours.
Hope this helps,
James.
On Thu, 2006-09-21 at 13:24 +0100, James Wilkinson wrote:
Since you say that this is a "scratch" test PC, I'd do a smartctl -H /dev/hda (which is probably what I should have told you in the first place). If that says "PASSED", I'd do a combination of dd if=/dev/zero of=/dev/hda to blank the drive (that should remap all the bad sectors), and dd if=/dev/hda of=/dev/null to read them all back. Then check for any more errors. If you don't get any, I'd trust the drive for testing purposes.
Those dd commands will probably take several hours.
Um, no actually. Under an hour, 'twas only a 15 gig drive. I did a quick test of seeing what what happen if I did dd to the drive that the computer had booted from. Watched it working, went away, came back to a black screen (about what I expected). Then I took the drive out and put it into another box; results below.
[root@box ~]# dd if=/dev/zero of=/dev/hdc dd: writing to `/dev/hdc': Input/output error 23953097+0 records in 23953096+0 records out
Above is as I'd expect. Below, seems about right (same output count as input, same number as worked above, and an error). I'm not sure at what stage a bad block gets mapped out of use. In the past, I'd have done that while prepping/formatting a drive.
[root@box ~]# dd if=/dev/hdc of=/dev/null dd: reading `/dev/hdc': Input/output error 23952864+0 records in 23952864+0 records out
Then did a "smartctl -t short /dev/hdc" looked at the results, then a "smartctl -t long /dev/hdc", results after both further below. The basic health check showed fine:
[root@box ~]# smartctl -H /dev/hdc smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
So that looks okay. But the "smartctl -a /dev/hdc" is less inspiring:
[root@box ~]# smartctl -a /dev/hdc smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION === Device Model: WDC WD153AA-00BAA0 Serial Number: WD-WMA2L2483801 Firmware Version: 10.09K11 User Capacity: 15,393,079,296 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 4 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Sep 23 19:13:27 2006 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (1040) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 14) minutes.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 197 098 051 Pre-fail Always - 45 3 Spin_Up_Time 0x0006 109 104 000 Old_age Always - 1150 4 Start_Stop_Count 0x0012 098 098 040 Old_age Always - 2524 5 Reallocated_Sector_Ct 0x0012 198 198 112 Old_age Always - 5 9 Power_On_Hours 0x0012 065 065 000 Old_age Always - 26136 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0012 098 098 000 Old_age Always - 2297 196 Reallocated_Event_Count 0x0012 196 196 000 Old_age Always - 4 197 Current_Pending_Sector 0x0012 200 199 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0012 100 253 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1 ATA Error Count: 572 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 572 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 18 cd 7e 6d e1 Error: UNC 24 sectors at LBA = 0x016d7ecd = 23953101
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 18 c8 7e 6d e1 00 00:57:28.650 READ DMA c8 00 20 c0 7e 6d e1 00 00:57:22.800 READ DMA c8 00 28 b8 7e 6d e1 00 00:57:16.700 READ DMA c8 00 30 b0 7e 6d e1 00 00:57:10.750 READ DMA c8 00 38 a8 7e 6d e1 00 00:57:04.750 READ DMA
Error 571 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 20 cd 7e 6d e1 Error: UNC 32 sectors at LBA = 0x016d7ecd = 23953101
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 20 c0 7e 6d e1 00 00:57:22.800 READ DMA c8 00 28 b8 7e 6d e1 00 00:57:16.700 READ DMA c8 00 30 b0 7e 6d e1 00 00:57:10.750 READ DMA c8 00 38 a8 7e 6d e1 00 00:57:04.750 READ DMA c8 00 40 a0 7e 6d e1 00 00:56:58.850 READ DMA
Error 570 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 28 cd 7e 6d e1 Error: UNC 40 sectors at LBA = 0x016d7ecd = 23953101
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 28 b8 7e 6d e1 00 00:57:16.700 READ DMA c8 00 30 b0 7e 6d e1 00 00:57:10.750 READ DMA c8 00 38 a8 7e 6d e1 00 00:57:04.750 READ DMA c8 00 40 a0 7e 6d e1 00 00:56:58.850 READ DMA c8 00 48 98 7e 6d e1 00 00:56:53.050 READ DMA
Error 569 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 30 cd 7e 6d e1 Error: UNC 48 sectors at LBA = 0x016d7ecd = 23953101
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 30 b0 7e 6d e1 00 00:57:10.750 READ DMA c8 00 38 a8 7e 6d e1 00 00:57:04.750 READ DMA c8 00 40 a0 7e 6d e1 00 00:56:58.850 READ DMA c8 00 48 98 7e 6d e1 00 00:56:53.050 READ DMA c8 00 50 90 7e 6d e1 00 00:56:47.350 READ DMA
Error 568 occurred at disk power-on lifetime: 1013 hours (42 days + 5 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 38 cd 7e 6d e1 Error: UNC 56 sectors at LBA = 0x016d7ecd = 23953101
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 38 a8 7e 6d e1 00 00:57:04.750 READ DMA c8 00 40 a0 7e 6d e1 00 00:56:58.850 READ DMA c8 00 48 98 7e 6d e1 00 00:56:53.050 READ DMA c8 00 50 90 7e 6d e1 00 00:56:47.350 READ DMA c8 00 58 88 7e 6d e1 00 00:56:41.550 READ DMA
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 1014 23953101 # 2 Short offline Completed: read failure 90% 1013 23953101 # 3 Extended offline Completed: read failure 30% 990 23953101 # 4 Short offline Completed without error 00% 990 - # 5 Short offline Completed without error 00% 327 - # 6 Short offline Completed without error 00% 93 - # 7 Short captive Completed without error 00% 0 -
Device does not support Selective Self Tests/Logging
Tests #1 & #2 are after the dd experiment, the rest are from before. A quick perusal of information doesn't give me any clues as to what the remaining and lifetime columns mean. Predicted failure time, uptime?
Tim wrote:
[root@box ~]# dd if=/dev/zero of=/dev/hdc dd: writing to `/dev/hdc': Input/output error 23953097+0 records in 23953096+0 records out
Whatever else smart says, *that* means either that the drive isn't capable of remapping bad sectors on write (bad), or that the drive has run out of spare sectors to remap in (worse).
The usual advice here is that the drive *is* failing. I suspect you've sent more time on it than the drive justifies.
James.
Late follow-up to this old thread:
Tim:
[root@box ~]# dd if=/dev/zero of=/dev/hdc dd: writing to `/dev/hdc': Input/output error 23953097+0 records in 23953096+0 records out
James Wilkinson:
Whatever else smart says, *that* means either that the drive isn't capable of remapping bad sectors on write (bad), or that the drive has run out of spare sectors to remap in (worse).
The usual advice here is that the drive *is* failing. I suspect you've sent more time on it than the drive justifies.
Yes, and no. I count it as valuable information about dealing with a failing system. Thankfully nothing on that system was really valuable, but it's useful knowledge for something else in the future. And I'm not about to spend somewhere around a whole day's potential wages on a new hard drive.
That system was reformatted, FC4 re-installed (no problems noticed), a webserver set up and left running. And at an uptime of 22:02:12 up 23 days, 1:00, right now, is still going strong with no errors being logged. It'll stay the experimental test box, though. It's not one I'd use for anything valueable - it's too much of a dog in numerous respects.
Thus far, I think this box has probably had about a maximum of $20 spent on it to build it (it was made from all the left-over parts, with the addition of a few cables, and a bit of RAM swapsies). The time spent on it is another matter, but then that time could have been spent learning on that box, or any other, just the same. ;-)
At 4:32 PM +0930 9/21/06, Tim wrote:
On Wed, 2006-09-20 at 17:57 +0100, James Wilkinson wrote:
You know about the smartctl -l error /dev/hda command, don't you? That will show you if the *disk* thinks it needs replacing.
Output below. Do you think this is an avoidable, or unavoidable error?
...
The output shows bad sectors, so it isn't (just) a filesystem error.
It would be nice to know whether the disk thinks it is failed, or just has some bad sectors. "smartctl -a /dev/hda" will tell everything.
Tim:
Output [snipped]. Do you think this is an avoidable, or unavoidable error?
Tony Nelson
The output shows bad sectors, so it isn't (just) a filesystem error.
It would be nice to know whether the disk thinks it is failed, or just has some bad sectors. "smartctl -a /dev/hda" will tell everything.
Here you go:
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION === Device Model: WDC WD153AA-00BAA0 Serial Number: WD-WMA2L2483801 Firmware Version: 10.09K11 User Capacity: 15,393,079,296 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 4 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Sep 22 08:28:06 2006 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (1040) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 14) minutes.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 098 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0006 109 104 000 Old_age Always - 1150 4 Start_Stop_Count 0x0012 098 098 040 Old_age Always - 2523 5 Reallocated_Sector_Ct 0x0012 198 198 112 Old_age Always - 5 9 Power_On_Hours 0x0012 065 065 000 Old_age Always - 26112 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0012 098 098 000 Old_age Always - 2296 196 Reallocated_Event_Count 0x0012 196 196 000 Old_age Always - 4 197 Current_Pending_Sector 0x0012 200 199 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 100 253 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1 ATA Error Count: 541 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 541 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 bd 34 9b e1 Error: UNC 8 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 bd 34 9b e1 00 00:20:09.950 READ DMA c8 00 08 b5 34 9b e1 00 00:20:09.950 READ DMA c8 00 08 ad 34 9b e1 00 00:20:09.950 READ DMA c8 00 08 a5 34 9b e1 00 00:20:09.950 READ DMA c8 00 40 05 36 9b e1 00 00:20:09.950 READ DMA
Error 540 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 67 33 9b e1 Error: UNC 8 sectors at LBA = 0x019b3367 = 26948455
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 65 33 9b e1 00 00:20:01.500 READ DMA ca 00 30 1d 33 9b e1 00 00:20:01.500 WRITE DMA c8 00 08 5d 33 9b e1 00 00:20:01.500 READ DMA c8 00 40 c5 34 9b e1 00 00:20:01.500 READ DMA c8 00 48 bd 34 9b e1 00 00:19:55.650 READ DMA
Error 539 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 48 bd 34 9b e1 Error: UNC 72 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 48 bd 34 9b e1 00 00:19:55.650 READ DMA c8 00 50 b5 34 9b e1 00 00:19:50.100 READ DMA c8 00 58 ad 34 9b e1 00 00:19:44.000 READ DMA c8 00 60 a5 34 9b e1 00 00:19:38.300 READ DMA c8 00 68 9d 34 9b e1 00 00:19:32.700 READ DMA
Error 538 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 50 bd 34 9b e1 Error: UNC 80 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 50 b5 34 9b e1 00 00:19:50.100 READ DMA c8 00 58 ad 34 9b e1 00 00:19:44.000 READ DMA c8 00 60 a5 34 9b e1 00 00:19:38.300 READ DMA c8 00 68 9d 34 9b e1 00 00:19:32.700 READ DMA c8 00 70 95 34 9b e1 00 00:19:26.900 READ DMA
Error 537 occurred at disk power-on lifetime: 960 hours (40 days + 0 hours) When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 58 bd 34 9b e1 Error: UNC 88 sectors at LBA = 0x019b34bd = 26948797
Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 58 ad 34 9b e1 00 00:19:44.000 READ DMA c8 00 60 a5 34 9b e1 00 00:19:38.300 READ DMA c8 00 68 9d 34 9b e1 00 00:19:32.700 READ DMA c8 00 70 95 34 9b e1 00 00:19:26.900 READ DMA c8 00 78 8d 34 9b e1 00 00:19:22.100 READ DMA
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 327 - # 2 Short offline Completed without error 00% 93 - # 3 Short captive Completed without error 00% 0 -
Device does not support Selective Self Tests/Logging
Supplemental information after doing a long self test (i.e. smartctl -t long /dev/hda):
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 30% 990 23953101 # 2 Short offline Completed without error 00% 990 - # 3 Short offline Completed without error 00% 327 - # 4 Short offline Completed without error 00% 93 - # 5 Short captive Completed without error 00% 0 -
At 6:32 PM +0930 9/22/06, Tim wrote:
Supplemental information after doing a long self test (i.e. smartctl -t long /dev/hda):
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 30% 990 23953101
This is expected as long as there are uncorrected bad blocks. You need to write to them somehow in order to fix them.
At 6:03 PM +0930 9/22/06, Tim wrote: ...
Tony Nelson
The output shows bad sectors, so it isn't (just) a filesystem error.
It would be nice to know whether the disk thinks it is failed, or just has some bad sectors. "smartctl -a /dev/hda" will tell everything.
Here you go:
...
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
This is the main thing.
...
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled.
You might want to turn this on with "smartctl -o on /dev/hda". It may make the disk a bit slower. It may scrub bad sectors before they become uncorrectable.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 098 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0006 109 104 000 Old_age Always - 1150 4 Start_Stop_Count 0x0012 098 098 040 Old_age Always - 2523 5 Reallocated_Sector_Ct 0x0012 198 198 112 Old_age Always - 5 9 Power_On_Hours 0x0012 065 065 000 Old_age Always - 26112 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0012 098 098 000 Old_age Always - 2296 196 Reallocated_Event_Count 0x0012 196 196 000 Old_age Always - 4 197 Current_Pending_Sector 0x0012 200 199 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 100 253 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
...
The drive doesn't think it has failed, as the WHEN_FAILED column is blank.