Hi all, on a new drive, installed about 3 months ago ... fsck shows:
Pass 5: Checking group summary information [ 117.650425] ata5.00: exception Emask 0x0 SAct 0xc0 SErr 0x0 action 0x0 [ 117.650678] ata5.00: irq_stat 0x40000008 [ 117.650840] ata5.00: failed command: READ FPDMA QUEUED [ 117.651523] ata5.00: cmd 60/80:30:68:08:40/00:00:be:00:00/40 tag 6 ncq 65536 in [ 117.651523] res 41/40:00:b0:08:40/00:00:be:00:00/40 Emask 0x409 (media error) <F> [ 117.652872] ata5.00: status: { DRDY ERR } [ 117.653547] ata5.00: error: { UNC } [ 117.655604] ata5.00: configured for UDMA/133 [ 117.655825] sd 4:0:0:0: [sdb] [ 117.655991] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 117.656631] sd 4:0:0:0: [sdb] [ 117.657288] Sense Key : Medium Error [current] [descriptor] [ 117.657966] Descriptor sense data with sense descriptors (in hex): [ 117.658652] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 117.659381] be 40 08 b0 [ 117.660065] sd 4:0:0:0: [sdb] [ 117.660703] Add. Sense: Unrecovered read error - auto reallocate failed [ 117.661349] sd 4:0:0:0: [sdb] CDB: [ 117.661981] Read(10): 28 00 be 40 08 68 00 00 80 00 [ 117.662667] blk_update_request: I/O error, dev sdb, sector 3191867568 [ 117.663323] ata5: EH complete [ 129.095402] ata5.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x0 [ 129.095663] ata5.00: irq_stat 0x40000008 [ 129.095826] ata5.00: failed command: READ FPDMA QUEUED [ 129.096434] ata5.00: cmd 60/00:c0:e8:08:00/01:00:c2:00:00/40 tag 24 ncq 131072 in [ 129.096434] res 41/40:00:90:09:00/00:00:c2:00:00/40 Emask 0x409 (media error) <F> [ 129.097727] ata5.00: status: { DRDY ERR } [ 129.098393] ata5.00: error: { UNC } [ 129.100528] ata5.00: configured for UDMA/133 [ 129.100768] sd 4:0:0:0: [sdb] [ 129.100935] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 129.101530] sd 4:0:0:0: [sdb] [ 129.102128] Sense Key : Medium Error [current] [descriptor] [ 129.102728] Descriptor sense data with sense descriptors (in hex): [ 129.103349] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 129.103998] c2 00 09 90 [ 129.104645] sd 4:0:0:0: [sdb] [ 129.105280] Add. Sense: Unrecovered read error - auto reallocate failed [ 129.105911] sd 4:0:0:0: [sdb] CDB: [ 129.106567] Read(10): 28 00 c2 00 08 e8 00 01 00 00 [ 129.107271] blk_update_request: I/O error, dev sdb, sector 3254782352 [ 129.107954] ata5: EH complete
So, I am puzzled as to how quickly were the spare sectors consumed so that automatic sector forwarding ran out os spare sectors? So quickly on a brand new drive?? When someone buys a brand new drive, how many spare sectors is it guaranteed to have? ZERO? ONE?? How many?
smartctl --all /dev/sdX and see how many are reallocated.
From my experience how many spare sectors there are depends on the
size and brand.
On Thu, Mar 12, 2015 at 2:29 PM, Joe Zeff joe@zeff.us wrote:
On 03/12/2015 12:20 PM, jd1008 wrote:
So, I am puzzled as to how quickly were the spare sectors consumed so that automatic sector forwarding ran out os spare sectors? So quickly on a brand new drive??
This is what warranties are for.
-- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Install the smartmon tools. Run smartctl -a on the drive. It may tell you something. Has the disk ever been dropped? Is the "on time" in smart too high compared to when you bought it (have they sold you a second hand drive)?
Sendt fra min Sony Xperiaâ„¢-smarttelefon
---- jd1008 skrev ----
Hi all, on a new drive, installed about 3 months ago ... fsck shows:
Pass 5: Checking group summary information [ 117.650425] ata5.00: exception Emask 0x0 SAct 0xc0 SErr 0x0 action 0x0 [ 117.650678] ata5.00: irq_stat 0x40000008 [ 117.650840] ata5.00: failed command: READ FPDMA QUEUED [ 117.651523] ata5.00: cmd 60/80:30:68:08:40/00:00:be:00:00/40 tag 6 ncq 65536 in [ 117.651523] res 41/40:00:b0:08:40/00:00:be:00:00/40 Emask 0x409 (media error) <F> [ 117.652872] ata5.00: status: { DRDY ERR } [ 117.653547] ata5.00: error: { UNC } [ 117.655604] ata5.00: configured for UDMA/133 [ 117.655825] sd 4:0:0:0: [sdb] [ 117.655991] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 117.656631] sd 4:0:0:0: [sdb] [ 117.657288] Sense Key : Medium Error [current] [descriptor] [ 117.657966] Descriptor sense data with sense descriptors (in hex): [ 117.658652] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 117.659381] be 40 08 b0 [ 117.660065] sd 4:0:0:0: [sdb] [ 117.660703] Add. Sense: Unrecovered read error - auto reallocate failed [ 117.661349] sd 4:0:0:0: [sdb] CDB: [ 117.661981] Read(10): 28 00 be 40 08 68 00 00 80 00 [ 117.662667] blk_update_request: I/O error, dev sdb, sector 3191867568 [ 117.663323] ata5: EH complete [ 129.095402] ata5.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x0 [ 129.095663] ata5.00: irq_stat 0x40000008 [ 129.095826] ata5.00: failed command: READ FPDMA QUEUED [ 129.096434] ata5.00: cmd 60/00:c0:e8:08:00/01:00:c2:00:00/40 tag 24 ncq 131072 in [ 129.096434] res 41/40:00:90:09:00/00:00:c2:00:00/40 Emask 0x409 (media error) <F> [ 129.097727] ata5.00: status: { DRDY ERR } [ 129.098393] ata5.00: error: { UNC } [ 129.100528] ata5.00: configured for UDMA/133 [ 129.100768] sd 4:0:0:0: [sdb] [ 129.100935] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 129.101530] sd 4:0:0:0: [sdb] [ 129.102128] Sense Key : Medium Error [current] [descriptor] [ 129.102728] Descriptor sense data with sense descriptors (in hex): [ 129.103349] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 129.103998] c2 00 09 90 [ 129.104645] sd 4:0:0:0: [sdb] [ 129.105280] Add. Sense: Unrecovered read error - auto reallocate failed [ 129.105911] sd 4:0:0:0: [sdb] CDB: [ 129.106567] Read(10): 28 00 c2 00 08 e8 00 01 00 00 [ 129.107271] blk_update_request: I/O error, dev sdb, sector 3254782352 [ 129.107954] ata5: EH complete
So, I am puzzled as to how quickly were the spare sectors consumed so that automatic sector forwarding ran out os spare sectors? So quickly on a brand new drive?? When someone buys a brand new drive, how many spare sectors is it guaranteed to have? ZERO? ONE?? How many?
-- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
But... but now I have to back up what I can and send it in for warranty.
My point was more about how could I run out of spare sectors when I have had no prior reports of hard sector errors prior to this fsck?
On 03/12/2015 01:29 PM, Joe Zeff wrote:
On 03/12/2015 12:20 PM, jd1008 wrote:
So, I am puzzled as to how quickly were the spare sectors consumed so that automatic sector forwarding ran out os spare sectors? So quickly on a brand new drive??
This is what warranties are for.
--
On 03/12/2015 01:37 PM, Roger Heflin wrote:
smartctl --all /dev/sd
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
So, when the raw value is 0 for both of these counts, why were not the bad file sectors automatically reallocated?
On 03/12/2015 01:52 PM, jd1008 wrote:
But... but now I have to back up what I can and send it in for warranty.
True, but isn't that better than not having the warranty at all? And, with some of the diagnostics others have suggested, you might not need to send it back at all. Still, if you don't already have a backup, you'd better get one RSN.
On Thu, Mar 12, 2015 at 1:20 PM, jd1008 jd1008@gmail.com wrote:
[ 117.660065] sd 4:0:0:0: [sdb] [ 117.660703] Add. Sense: Unrecovered read error - auto reallocate failed
What is sdb used for? Single drive or is it in some kind of raid?
This is a URE with failed reallocation which means the data and checksum for this sector mismatch, and the drive's ECC can't recover it. Since it can't reconstruct it, the data can't be moved so it just stays here. The problem isn't corrected until the sector is overwritten.
What do you get for: smarctl -l scterc /dev/sdb
So, I am puzzled as to how quickly were the spare sectors consumed so that automatic sector forwarding ran out os spare sectors?
The data on the sector is just bad. The firmware apparently doesn't think the signal encoding on that sector is weak or otherwise indicating the sector itself may be bad, like a surface defect. I'm guessing it's getting deterministically bad data and it can't fix it. If it were a partial read, then that might suggest a bad sector, at which point it'd be flagged as pending reallocation. This is conditional. Upon write, the drive firmware determines whether it's a transient error or persistent, if it's persistent then it'll use a reserve sector (reassigns the same LBA to a different physical sector).
So quickly on a brand new drive?? When someone buys a brand new drive, how many spare sectors is it guaranteed to have? ZERO? ONE?? How many?
Depends on the drive model, use case, and manufacturer. Any enterprise SATA or SAS drive, it's a swap out for just one of these. For consumer, it's stick a wet finger in the air. I think most of them would accommodate, but arguably the drive is functioning normally - so far. But you're right, it's brand new and to have a sector read error on a brand new drive is unexpected.
On Thu, Mar 12, 2015 at 3:28 PM, Chris Murphy lists@colorremedies.com wrote:
What do you get for: smarctl -l scterc /dev/sdb
smartctl -l scterc /dev/sdb
Helps to not have typos, and this command likes that additional t.
On 03/12/2015 04:03 PM, jd1008 wrote:
--
On 03/12/2015 01:37 PM, Roger Heflin wrote:
smartctl --all /dev/sd
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
You omitted an important parameter, 197 Current_Pending_Sector. The drive cannot reallocate on a read error. Reallocation has to wait until the sector is next written. The sector will remain in the "Pending" state until that occurs.
On 03/12/2015 03:28 PM, Chris Murphy wrote:
smarctl -l scterc /dev/sdb
# smartctl -l scterc /dev/sdb smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.18.8-201.fc21.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
On 03/12/2015 03:33 PM, Robert Nichols wrote:
On 03/12/2015 04:03 PM, jd1008 wrote:
--
On 03/12/2015 01:37 PM, Roger Heflin wrote:
smartctl --all /dev/sd
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
You omitted an important parameter, 197 Current_Pending_Sector. The drive cannot reallocate on a read error. Reallocation has to wait until the sector is next written. The sector will remain in the "Pending" state until that occurs.
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 15
is that bad or not so bad?
On Thu, Mar 12, 2015 at 3:37 PM, jd1008 jd1008@gmail.com wrote:
On 03/12/2015 03:28 PM, Chris Murphy wrote:
smarctl -l scterc /dev/sdb
# smartctl -l scterc /dev/sdb smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.18.8-201.fc21.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
Ok so it's a drive not designed for fast recoveries, which is fine. I've seen some people use the NAS drives as standalone drives thinking they're better. But they have fast error recoveries, so they pop a read error within ~7 seconds expecting RAID to recover the data from a mirror or rebuild from parity. So if you had such a drive there's a way to increase that timer and maybe recover the data on that sector.
Anyway, seeing as this happens on an fsck, that means filesystem metadata is affected and if e2fsck -f doesn't fix it then, the fs is toast. I honestly would just immediately remount it ro, and back it up though before forcing an fsck. An fsck ought to fail gracefully and not make things worse, but...
On Thu, Mar 12, 2015 at 3:58 PM, Chris Murphy lists@colorremedies.com wrote:
Anyway, seeing as this happens on an fsck, that means filesystem metadata is affected and if e2fsck -f doesn't fix it then, the fs is toast. I honestly would just immediately remount it ro, and back it up though before forcing an fsck. An fsck ought to fail gracefully and not make things worse, but...
For what it's worth, in the same scenario, by default Btrfs on an HDD use duplicate metadata on a single drive. So the same read error would get reported, but then there'd be something like this:
[48466.824770] BTRFS: checksum error at logical 20971520 on dev /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3 [48466.829900] BTRFS: checksum error at logical 20971520 on dev /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3 [48466.834944] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [48466.853589] BTRFS: fixed up error at logical 20971520 on dev /dev/sdb
That's actually a corrupt sector, rather than read error, but the result is the same. Btrfs uses the duplicate copy and fixes the bad one automatically. Life continues. The same thing for data if there's a mirror copy (or raid56 parity, since kernel 3.19).
For single copy data, it'll show a path to the affected file. For single copy of fs metadata, well, bad things happen too, chances are the fs will abruptly go forced read only. For the most part Btrfs has been decently graceful lately if it successfully mounts read only.
I still keep multiple backups though. Ultimately I trust nothing but many copies.
Unless you have the drive under raid that means that 15 sectors cannot be read and you have lost at least some data.
The drives normally will not move sectors that it cannot successfully read.
You may be able to copy the data off the disk, but you may when trying this find a lot more bad sectors that the 15 that are currently pending so may find your lost more data than 15 bad sectors would indicate.
If under mdadm raid then the raid sees the bad sector error and will reconstruct and write the data back causing the disk to move it.
For the future my only suggestion is either to use raid and/or force the reading of the disk at least weekly so that the disk will detect "weak" sectors and either rewrite them or move them as needed. On all of my critical disks I read and/or smartctl -t long all disks at least weekly. From playing with a disk that is going bad it appears that doing the -t long daily might keep ahead of sectors going bad, but that means that the test runs the disk (though it can be accessed for normal usage) for several hours a day each day.
On Thu, Mar 12, 2015 at 4:40 PM, jd1008 jd1008@gmail.com wrote:
On 03/12/2015 03:33 PM, Robert Nichols wrote:
On 03/12/2015 04:03 PM, jd1008 wrote:
--
On 03/12/2015 01:37 PM, Roger Heflin wrote:
smartctl --all /dev/sd
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
You omitted an important parameter, 197 Current_Pending_Sector. The drive cannot reallocate on a read error. Reallocation has to wait until the sector is next written. The sector will remain in the "Pending" state until that occurs.
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
15is that bad or not so bad?
-- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
On Thu, Mar 12, 2015 at 7:45 PM, Roger Heflin rogerheflin@gmail.com wrote:
Unless you have the drive under raid that means that 15 sectors cannot be read and you have lost at least some data.
The drives normally will not move sectors that it cannot successfully read.
You may be able to copy the data off the disk, but you may when trying this find a lot more bad sectors that the 15 that are currently pending so may find your lost more data than 15 bad sectors would indicate.
Yes, could be true.
For a drive with 15 sectors pending reallocation so far on a 3 month old drive, back it up, get it replaced under warranty. This fits the profile for a drive in early failure. Most drives don't do this, but of the drives in a batch that will exhibit early failure tend to do so right about at 3 months.
Depending on what parts of the fs metadata are affected, 15 sectors could possibly make a good portion of the volume unrecoverable.
For the future my only suggestion is either to use raid and/or force the reading of the disk at least weekly so that the disk will detect "weak" sectors and either rewrite them or move them as needed. On all of my critical disks I read and/or smartctl -t long all disks at least weekly. From playing with a disk that is going bad it appears that doing the -t long daily might keep ahead of sectors going bad, but that means that the test runs the disk (though it can be accessed for normal usage) for several hours a day each day.
I'd say weekly is aggressive, but reasonable for an important array. The scrubs are probably more valuable because any explicit read errors get fixed up, whereas that's not necessarily the case for smartctl -t long. I have several drives in enclosures with crap bridge chip sets, so smartctl doesn't work, they've never had smart testing, none have had read or write errors. I'd say if it takes even weekly extended offline smart tests to avoid problems, the drive is bad.
Understood this thread's use case isn't raid, but it bears repeating: By default, consumer drives like this one, will often attempt much longer recoveries, beyond the SCSI command timer value of 30 seconds; such recoveries get thwarted by the ensuing link reset, so the problem sector(s) aren't revealed, and don't get fixed. This is a problem for md, LVM, ZFS and Btrfs raid. So configuration has to be correct. This is a monthly event on linux-raid@ (or even more often sometimes several per week), and a high percent of the time all data on the raid is lost in the ensuing recovery.
Backups!
On 03/12/2015 03:58 PM, Chris Murphy wrote:
On Thu, Mar 12, 2015 at 3:37 PM, jd1008 jd1008@gmail.com wrote:
On 03/12/2015 03:28 PM, Chris Murphy wrote:
smarctl -l scterc /dev/sdb
# smartctl -l scterc /dev/sdb smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.18.8-201.fc21.x86_64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control command not supported
Ok so it's a drive not designed for fast recoveries, which is fine. I've seen some people use the NAS drives as standalone drives thinking they're better. But they have fast error recoveries, so they pop a read error within ~7 seconds expecting RAID to recover the data from a mirror or rebuild from parity. So if you had such a drive there's a way to increase that timer and maybe recover the data on that sector.
Anyway, seeing as this happens on an fsck, that means filesystem metadata is affected and if e2fsck -f doesn't fix it then, the fs is toast. I honestly would just immediately remount it ro, and back it up though before forcing an fsck. An fsck ought to fail gracefully and not make things worse, but...
Thanx. The drive a 2TB WD drive, so it is not a NAS. Will back up what I can, clear it and send it for rma.
On 03/12/2015 07:45 PM, Roger Heflin wrote:
Unless you have the drive under raid that means that 15 sectors cannot be read and you have lost at least some data.
The drives normally will not move sectors that it cannot successfully read.
You may be able to copy the data off the disk, but you may when trying this find a lot more bad sectors that the 15 that are currently pending so may find your lost more data than 15 bad sectors would indicate.
If under mdadm raid then the raid sees the bad sector error and will reconstruct and write the data back causing the disk to move it.
For the future my only suggestion is either to use raid and/or force the reading of the disk at least weekly so that the disk will detect "weak" sectors and either rewrite them or move them as needed. On all of my critical disks I read and/or smartctl -t long all disks at least weekly. From playing with a disk that is going bad it appears that doing the -t long daily might keep ahead of sectors going bad, but that means that the test runs the disk (though it can be accessed for normal usage) for several hours a day each day.
On Thu, Mar 12, 2015 at 4:40 PM, jd1008 jd1008@gmail.com wrote:
On 03/12/2015 03:33 PM, Robert Nichols wrote:
On 03/12/2015 04:03 PM, jd1008 wrote:
--
On 03/12/2015 01:37 PM, Roger Heflin wrote:
smartctl --all /dev/sd
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
You omitted an important parameter, 197 Current_Pending_Sector. The drive cannot reallocate on a read error. Reallocation has to wait until the sector is next written. The sector will remain in the "Pending" state until that occurs.
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always
15is that bad or not so bad?
-- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Thanx! Good advice with the smartctl -t long.