I quite stupidly unplugged my computer today when my foot got tangled in the power strip. On reboot, I was notified
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk).
Following the info here
http://kaivanov.blogspot.com/2010/09/fixing-disk-problems-under-linux-with.h...
I'm not able to determine what file might be sitting on this block.
tune2fs -l /dev/sda5 |grep Block Block count: 472552448 Block size: 4096 Blocks per group: 32768 [root@sds-desk-2 ~]# debugfs debugfs 1.42.3 (14-May-2012) debugfs: open /dev/sda5 debugfs: icheck 235992741 Block Inode number 235992741 8 debugfs: ncheck 8 Inode Pathname
This suggests that the problem is not in a currently extant file.
I'm now trying the suggestion of writing from /dev/zero to a file in /home while in single users mode. Given that I have 1.5TB to fill, this could take a while.
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
On 10/26/2012 05:33 PM, Reindl Harald wrote:
Am 27.10.2012 01:29, schrieb Steven Stern:
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk)
sounds bad
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
REALLY: throw away the disk and do not restore backups to it
be thankful that you got warned instead lsoe the disk from one moment to another and replace it
+1 Erase the disk thoroughly (using hdparm) and send it for warranty repair/replace if still under warranty.
Am 27.10.2012 01:29, schrieb Steven Stern:
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk)
sounds bad
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
REALLY: throw away the disk and do not restore backups to it
be thankful that you got warned instead lsoe the disk from one moment to another and replace it
On 10/26/2012 06:29 PM, Steven Stern wrote:
I quite stupidly unplugged my computer today when my foot got tangled in the power strip. On reboot, I was notified
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk).
Following the info here
http://kaivanov.blogspot.com/2010/09/fixing-disk-problems-under-linux-with.h...
I'm not able to determine what file might be sitting on this block.
tune2fs -l /dev/sda5 |grep Block Block count: 472552448 Block size: 4096 Blocks per group: 32768 [root@sds-desk-2 ~]# debugfs debugfs 1.42.3 (14-May-2012) debugfs: open /dev/sda5 debugfs: icheck 235992741 Block Inode number 235992741 8 debugfs: ncheck 8 Inode Pathname
This suggests that the problem is not in a currently extant file.
I'm now trying the suggestion of writing from /dev/zero to a file in /home while in single users mode. Given that I have 1.5TB to fill, this could take a while.
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
The pending sectors will be reallocated only when they are written to. Unless invoked with the very slow "-cc" option, mke{2,3,4}fs will write only to the metadata areas, and restoring /home will of course not write to any remaining free space.
On 10/26/2012 07:51 PM, Robert Nichols wrote:
On 10/26/2012 06:29 PM, Steven Stern wrote:
I quite stupidly unplugged my computer today when my foot got tangled in the power strip. On reboot, I was notified
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk).
Following the info here
http://kaivanov.blogspot.com/2010/09/fixing-disk-problems-under-linux-with.h...
I'm not able to determine what file might be sitting on this block.
tune2fs -l /dev/sda5 |grep Block Block count: 472552448 Block size: 4096 Blocks per group: 32768 [root@sds-desk-2 ~]# debugfs debugfs 1.42.3 (14-May-2012) debugfs: open /dev/sda5 debugfs: icheck 235992741 Block Inode number 235992741 8 debugfs: ncheck 8 Inode Pathname
This suggests that the problem is not in a currently extant file.
I'm now trying the suggestion of writing from /dev/zero to a file in /home while in single users mode. Given that I have 1.5TB to fill, this could take a while.
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
The pending sectors will be reallocated only when they are written to. Unless invoked with the very slow "-cc" option, mke{2,3,4}fs will write only to the metadata areas, and restoring /home will of course not write to any remaining free space.
The gods of stupidity have been good to me. I had a backup of everything on /home that wasn't stored in the cloud, so I cleared all the files on /home, restored from my backups, and smartctl reports the sectors have been moved; 0 sectors are pending reallocation.
I got lucky.
On 10/26/2012 07:33 PM, Reindl Harald wrote:
Am 27.10.2012 01:29, schrieb Steven Stern:
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk)
sounds bad
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
REALLY: throw away the disk and do not restore backups to it
be thankful that you got warned instead lsoe the disk from one moment to another and replace it
I would have to agree with this as I had the same thing happen to me, the final result was a completely new disk in the machine....there was no other alternative for it,...BUT I was able to restore a lot of data / music / files / folders / pictures because I was always making backups of my system on a weekly basis....word to the wise....get ahold of some external media and use it to hold backups of your system.
EGO II
On Sat, 27 Oct 2012 01:33:55 +0200 Reindl Harald h.reindl@thelounge.net wrote:
Am 27.10.2012 01:29, schrieb Steven Stern:
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Smartctl tells me the problem is at LBA 2014551336 (this is a 2TB disk)
sounds bad
Here's the question: If I just reformat and restore /home from a recent backup, will the disk automatically deal with the sectors pending reallocation?
REALLY: throw away the disk and do not restore backups to it
Rubbish.
If you powered down a drive which was writing then in obscure cases some drives will fail to complete the sector. The next time you write to that logical block it will either rewrite it successfully and fix the problem or it will map another sector to it.
Quite different to physical damage. In almost every case in fact a journaled file system will just keep on working fine because any problem with data still 'live' that was being written will be fixed by the journal replay.
Alan
Once, long ago--actually, on Mon, Oct 29, 2012 at 06:19:59AM CDT--Alan Cox (alan@lxorguk.ukuu.org.uk) said:
Rubbish.
Beg to differ. Remember, drives have reserved storage to remap bad sectors *before* you ever see a bad sector at the interface. So, by the time you think you're seeing only 8 bad sectors, you've actually burned through the reserved sectors--many more have failed than you realize.
Look at the SMART report on the drive; that will tell you how many sectors have been reallocated and how many spare sectors are left in the G-list (user microcode remapping).
G'luck, -- Dave Ihnat dihnat@dminet.com
On 10/29/2012 07:53 AM, Dave Ihnat wrote:
Once, long ago--actually, on Mon, Oct 29, 2012 at 06:19:59AM CDT--Alan Cox (alan@lxorguk.ukuu.org.uk) said:
Rubbish.
Beg to differ. Remember, drives have reserved storage to remap bad sectors *before* you ever see a bad sector at the interface. So, by the time you think you're seeing only 8 bad sectors, you've actually burned through the reserved sectors--many more have failed than you realize.
Look at the SMART report on the drive; that will tell you how many sectors have been reallocated and how many spare sectors are left in the G-list (user microcode remapping).
G'luck,
Dave Ihnat dihnat@dminet.com
So... Am I OK or an Deep S**t? Reallocated Sector Count is 0. Good?
=== START OF INFORMATION SECTION === Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors) Device Model: ST2000DM001-9YN164 Serial Number: W1E0KZKN LU WWN Device Id: 5 000c50 049dbcb2d Firmware Version: CC4B User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Mon Oct 29 07:58:59 2012 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 575) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 225) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 168095056 3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 30 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 15042880 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1689 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 30 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 082 082 000 Old_age Always - 18 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 064 056 045 Old_age Always - 36 (Min/Max 33/41) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 091 091 000 Old_age Always - 19452 194 Temperature_Celsius 0x0022 036 044 000 Old_age Always - 36 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 29038273889717 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 17196577503767 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 43502398465569
Beg to differ. Remember, drives have reserved storage to remap bad sectors *before* you ever see a bad sector at the interface. So, by the time you think you're seeing only 8 bad sectors, you've actually burned through the reserved sectors--many more have failed than you realize.
Completely wrong.
You see a bad sector on some devices after a sudden power fail because the sector was partially written when the power failed. It's not bad in any permanent sense it's just got incomplete data on it so cannot be read back properly until rewritten.
Similarly btw any case where a sector reports as "bad" on a read does not mean you've used up all the spare sectors or anything of the sort, it means you've got a sector which failed to read.
Some drives also (quite validly) report the number of sectors that were failed during the production of the device, which completely throws most of the rather weak drive reporting stuff software.
What really matters is whether the drive itself thinks on its smart test whether it is likely to be failing not some joke heuristic.
Bad sectors *can* be a sign of problems - be they drive failure, mechanical mounting problems (eg vibration), poor cooling, poor power and so on but not in this case. The warnings also often look different - you see a continuing to rise number of bad sectors.
Alan
Alan Cox alan@lxorguk.ukuu.org.uk writes:
What really matters is whether the drive itself thinks on its smart test whether it is likely to be failing not some joke heuristic.
And even then, if the 5-year old Google numbers are still valid, you have a 1/3 chance of being surprised.
"The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive."
ref: http://storagemojo.com/2007/02/19/googles-disk-failure-experience/
-wolfgang
Once, long ago--actually, on Mon, Oct 29, 2012 at 08:57:12AM CDT--Alan Cox (alan@lxorguk.ukuu.org.uk) said:
Completely wrong.
With all due respect, no it isn't. It was simplistic, because I didn't want to go into an entire tutorial on the mailing list. Go read about P-lists, G-lists, and about the variation in how SMART information is generated and used. This is generally because there was no industry-wide agreement on exactly what should be reported, and how, for SMART, as well as some proprietary protectionism from drive manuafacturers.
You see a bad sector on some devices after a sudden power fail because the sector was partially written when the power failed. It's not bad in any permanent sense it's just got incomplete data on it so cannot be read back properly until rewritten.
Some devices, and some drivers. It depends on how the driver interpreted the failure, how it was reported by that disk and that controller, etc. This particular scenario comprises a vanishingly small number of the total number of errors a drive encounters in its life.
Similarly btw any case where a sector reports as "bad" on a read does not mean you've used up all the spare sectors or anything of the sort, it means you've got a sector which failed to read.
Report of a single bad read depends very much on the driver. Most driver authors will try to re-read a failed sector in the driver (no, I haven't gone to look at the current Linux drivers. I was writing drivers for Unix a long, long time ago, however, and we wouldn't report a bad read until something like three internal retries.) However, the firmware will do remapping of sectors _it_ determines are failing, and the probability is that if you are seeing errors consistently, it's because it can't remap.
Some drives also (quite validly) report the number of sectors that were failed during the production of the device, which completely throws most of the rather weak drive reporting stuff software.
Not at all valid. There are two lists in the drive--the P-list, which contains sectors remapped during drive production, and the G-list, which contains remaps instantiated by the drive firmware. IF the firmware reports remapped sectors from both lists, it should most assuredly segregate them. Granted, it's quite possible the author(s) of drive reporting software have conflated or misinterpreted the data.
What really matters is whether the drive itself thinks on its smart test whether it is likely to be failing not some joke heuristic.
*Shrug*. The actual number of sectors in the G-list isn't a joke heuristic. Whether or not you can get that out of the firmware is vendor-specific. The bottom line is that there _is_ firmware in the drive that attempts to prevent sectors going bad from being used by preemtively remapping to spare sectors, and this will, in general, mean that you won't know about them until and unless you look at the stats on sector remapping. Whether you, as an administrator, go out to try to find that, or the driver proactively does it, or whatever, by the time you're getting real, persistent and increasing numbers of errors visible at the OS interface, you've probably been having problems internally for a while. It's not worth it today to go to excessive lengths to repair a disk reporting errors.
Cheers, -- Dave Ihnat dihnat@dminet.com
On 10/29/2012 11:30 AM, Dave Ihnat wrote:
However, the firmware will do remapping of sectors _it_ determines are failing, and the probability is that if you are seeing errors consistently, it's because it can't remap.
A sector that is unreadable even after retries CANNOT be remapped until it is written, and any attempts to read it MUST return an I/O error until that remapping has occurred. If the drive were to go ahead and immediately remap that unreadable sector, what data would you suggest that it return when the sector is read? All-zeros with no indication of error is NOT acceptable.
Once, long ago--actually, on Mon, Oct 29, 2012 at 12:59:44PM CDT--Robert Nichols (rnicholsNOSPAM@comcast.net) said:
A sector that is unreadable even after retries CANNOT be remapped until it is written, and any attempts to read it MUST return an I/O error until that remapping has occurred.
On a full failure, yes. The firmware should, however, detect sectors that are failing and remap them on an ongoing basis. You wouldn't get notifications about those, and they should be far more frequent than undetected full read failures.
If the drive were to go ahead and immediately remap that unreadable sector, what data would you suggest that it return when the sector is read? All-zeros with no indication of error is NOT acceptable.
Of course not, and just what the firmware should do with a full read failure of a previously unsuspected bad sector, I'm sure, has been the subject of design meetings at the various disk manufacturers. My suspicion is that such a full failure would have to be exempt from automatic remapping, resulting in reported failures before all of the available spare sectors are allocated.
I would, however, expect such a condition to be either a rare occurance--due to physical damage/shock, unexpected power failure, etc.--making it a class of general "one-off" events, or part of an increasing cascade of detected predictive failures resulting in automatic remapping in the case of a failing disk.
And I suspect we've gotten much more deeply into the topic than I expect most of the list cares about.
Cheers, -- Dave Ihnat dihnat@dminet.com
On 10/29/2012 01:32 PM, Dave Ihnat wrote:
and just what the firmware should do with a full read failure of a previously unsuspected bad sector, I'm sure, has been the subject of design meetings at the various disk manufacturers.
They show up in the "Current pending sector" count in the SMART report, and are not an unusual occurrence at all.