FC5 S/W Raid Rebuilding to Infiinity(and beyond!)

Tue Nov 21 13:42:49 UTC 2006

Sean Bruno wrote:
> On Mon, 2006-11-20 at 19:17 -0800, Sean Bruno wrote:
>> On Mon, 2006-11-20 at 07:56 -0800, Sean Bruno wrote:
>>> I had a disk failure recently and replaced the drive.  After
>>> partitioning and such I added my new drive into my Raid 1 and waited for
>>> the rebuild to complete.
>>>
>>> It's been running for about 36 hours trying to rebuild a ~140GB Raid 1,
>>> which seems a bit long to me.  
>>>
>>> What's even stranger, is a reboot causes the new disk to be complete
>>> removed from the Raid 1 set.  And I have to rebuild all of my
>>> partitions, not just the ~140GB.
>>>
>>> Here is 'cat /proc/mdstat' as it currently sits:
>>>
>>> ----
>>> [sean at home-desk ~]$ cat /proc/mdstat
>>> Personalities : [raid1]
>>> md0 : active raid1 sdb1[0] sda1[1]
>>>       1052160 blocks [2/2] [UU]
>>>
>>> md1 : active raid1 sdb2[0] sda2[1]
>>>       4192896 blocks [2/2] [UU]
>>>
>>> md2 : active raid1 sdb3[2] sda3[1]
>>>       151043008 blocks [2/1] [_U]
>>>       [==>..................]  recovery = 10.5% (15941760/151043008)
>>> finish=52.3min speed=42986K/sec
>>>
>>> unused devices: <none>
>>> ----
>>>
>>> Where md0 is /boot, md1 is swap and md2 is /  
>>>
>>> sdb is the new disk, sda is the running disk.  If I reboot the machine
>>> sdb disappears completely.
>>>
>>> Any ideas out there?
>>>
>>> Sean
>>>
>>>
>>>
>> I guess that this is being caused by a 'real' failure on /dev/sda:
>>
>> Nov 20 15:58:39 home-desk kernel: RAID1 conf printout:
>> Nov 20 15:58:39 home-desk kernel:  --- wd:1 rd:2
>> Nov 20 15:58:39 home-desk kernel:  disk 0, wo:1, o:1, dev:sdb3
>> Nov 20 15:58:39 home-desk kernel:  disk 1, wo:0, o:1, dev:sda3
>> Nov 20 15:58:39 home-desk kernel: RAID1 conf printout:
>> Nov 20 15:58:39 home-desk kernel:  --- wd:1 rd:2
>> Nov 20 15:58:39 home-desk kernel:  disk 1, wo:0, o:1, dev:sda3
>> Nov 20 15:58:39 home-desk kernel: RAID1 conf printout:
>> Nov 20 15:58:39 home-desk kernel:  --- wd:1 rd:2
>> Nov 20 15:58:39 home-desk kernel:  disk 0, wo:1, o:1, dev:sdb3
>> Nov 20 15:58:39 home-desk kernel:  disk 1, wo:0, o:1, dev:sda3
>> Nov 20 15:58:39 home-desk kernel: md: syncing RAID array md2
>> Nov 20 15:58:39 home-desk kernel: md: minimum _guaranteed_
>> reconstruction speed: 1000 KB/sec/disc.
>> Nov 20 15:58:39 home-desk kernel: md: using maximum available idle IO
>> bandwidth (but not more than 200000 KB/sec) for reconstruction.
>> Nov 20 15:58:39 home-desk kernel: md: using 128k window, over a total of
>> 151043008 blocks.
>>
>> Nov 20 16:58:01 home-desk kernel: ata1.00: exception Emask 0x0 SAct 0x0
>> SErr 0x0 action 0x0
>> Nov 20 16:58:01 home-desk kernel: ata1.00: (BMDMA stat 0x0)
>> Nov 20 16:58:01 home-desk kernel: ata1.00: tag 0 cmd 0x25 Emask 0x9 stat
>> 0x51 err 0x40 (media error)
>> Nov 20 16:58:01 home-desk kernel: ata1: EH complete
>> Nov 20 16:58:02 home-desk kernel: ata1.00: exception Emask 0x0 SAct 0x0
>> SErr 0x0 action 0x0
>> Nov 20 16:58:02 home-desk kernel: ata1.00: (BMDMA stat 0x0)
>> Nov 20 16:58:02 home-desk kernel: ata1.00: tag 0 cmd 0x25 Emask 0x9 stat
>> 0x51 err 0x40 (media error)
>> Nov 20 16:58:02 home-desk kernel: ata1: EH complete
>> Nov 20 16:58:03 home-desk kernel: ata1.00: exception Emask 0x0 SAct 0x0
>> SErr 0x0 action 0x0
>> Nov 20 16:58:03 home-desk kernel: ata1.00: (BMDMA stat 0x0)
>> Nov 20 16:58:03 home-desk kernel: ata1.00: tag 0 cmd 0x25 Emask 0x9 stat
>> 0x51 err 0x40 (media error)
>> Nov 20 16:58:03 home-desk kernel: ata1: EH complete
>> Nov 20 16:58:04 home-desk kernel: ata1.00: exception Emask 0x0 SAct 0x0
>> SErr 0x0 action 0x0
>> ...
>> Nov 20 16:59:01 home-desk kernel: sd 0:0:0:0: SCSI error: return code =
>> 0x08000002
>> Nov 20 16:59:01 home-desk kernel: sda: Current: sense key: Medium Error
>> Nov 20 16:59:01 home-desk kernel:     Additional sense: Unrecovered read
>> error - auto reallocate failed
>> Nov 20 16:59:01 home-desk kernel: end_request: I/O error, dev sda,
>> sector 307380301
>> Nov 20 16:59:01 home-desk kernel: ata1: EH complete
>> Nov 20 16:59:01 home-desk kernel: ata1.00: exception Emask 0x0 SAct 0x0
>> SErr 0x0 action 0x0
>> Nov 20 16:59:01 home-desk kernel: ata1.00: (BMDMA stat 0x0)
>> Nov 20 16:59:01 home-desk kernel: ata1.00: tag 0 cmd 0x25 Emask 0x9 stat
>> 0x51 err 0x40 (media error)
>> Nov 20 16:59:01 home-desk kernel: ata1: EH complete
>> Nov 20 16:59:01 home-desk kernel: SCSI device sda: 312581808 512-byte
>> hdwr sectors (160042 MB)
>> ...
>>
>> This repeats over-and-over-and-over throughout my logs.  How can I get
>> it to rebuild once and then stop? 
>>
>> Sean
>>
>>
> And finally(if responding to my own post wasn't annoying enough!), if I
> rebuild md0 and md1(skipping md2 for now), then reboot the machine, the
> machine comes backup with all three devices as failed!
> 
> I start the rebuild on /dev/md0 and /dev/md1 thusly:
> 
> [root at home-desk ~]# mdadm --manage --add /dev/md0 /dev/sdb1
> mdadm: re-added /dev/sdb1
> [root at home-desk ~]# mdadm --manage --add /dev/md1 /dev/sdb2
> mdadm: re-added /dev/sdb2
> 
> Before reboot(cat /proc/mdstat):
> [root at home-desk ~]# cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdb1[0] sda1[1]
>       1052160 blocks [2/2] [UU]
> 
> md1 : active raid1 sdb2[0] sda2[1]
>       4192896 blocks [2/2] [UU]
> 
> md2 : active raid1 sda3[1]
>       151043008 blocks [2/1] [_U]
> 
> All is well with md0 and md1 for now.  I will work on recovering md2
> later.  But if I reboot, sdb1 and sdb2 disappear from my raid
> configuration, as if I hadn't added them somewhere?
> 
> Personalities : [raid1]
> md0 : active raid1 sda1[1]
>       1052160 blocks [2/1] [_U]
> 
> md1 : active raid1 sda2[1]
>       4192896 blocks [2/1] [_U]
> 
> md2 : active raid1 sda3[1]
>       151043008 blocks [2/1] [_U]
> 
> unused devices: <none>
> 
> 
> Any ideas on how to make this 're-add' stick?
> 
> Sean
> 

You have found yourself in the same situation I found myself in recently. 
Actually my situation was slightly different, but the resulting problem is the 
same. In my case at re-boot md decided that one partition of a mirror was out of 
sync, and so initiated a re-sync with the other partition. However, the 
partition which was active contained a bad sector, so the re-sync failed, over 
and over and over..., just like yours is doing.

In order to fix my system I used the following steps.

The first step is to take the offending filesystem offline. Then I copied the 
existing partition onto the good disk using dd, with the noerror option so it 
would continue past read errors. In my case I knew that the read error was not 
part of the actual filesystem in use because it passed fsck. When the copy was 
complete I ran fsck on the new filesystem just to be sure it had copied ok.

After this I created a new RAID consisting of just the good partition (in my 
case the RAID was md1 and the new partition was sda3):
  # mdadm -C /dev/md1 --force -n 1 -l 1 /dev/sda3

As a temporary fix, until a new disk arrived, I ran
   # e2fsk -c -d -f /dev/sdb3
to mark back blocks (sdb3 was the failing partition).
Then I ran:
   # mdadm --zero-superblock /dev/sdb3
to remove the md superblock from the partition so it was no longer part of a RAID.

Finally, I used mdadm to add the dodgy partition back into the RAID:

# mdadm -a /dev/md1 /dev/sdb3

and to grow the RAID to 2 partitions:

# mdadm --grow -n 2 /dev/md1

When the replacement disk arrived I failed all the /dev/sdb partitions, and 
added the new disk.

I'm not sure what the problem is with your /dev/sda, it might be a simple block 
read error. But you might also have to replace that one.

-- 
Nigel Wade, System Administrator, Space Plasma Physics Group,
             University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw at ion.le.ac.uk
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555