Hi, everybody.
I've found a workaround for a problem with a raid 1 array. I'm posting to share the "solution" and to ask why this went wrong in the first place.
On an old test machine, I have 2 4-year-old Seagate IDE drives in raid1 mirror for my home partition. One failed and Seagate was very very pleasant to replace it. I didn't even need a receipt! They just went by the serial number to conclude I was in warranty.
I followed "the usual" procedure for failed drives. mdadm marked the drive as a failure and it was removed from the array. Then I tried to add it back into the raid 1.
One of the HOWTOs I relied on was this one: http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array
Here's what went wrong. After being added, the new drive went through a long "recovery" process--2 hours--but when it finished, the new drive was marked as "spare" and the raid 1 array continued to show only one drive was active.
Every time the system restarts, the new drive tries to resync itself, it copies for 2 hours, but it never enters the array. It is always spare.
In the end, gave up trying to fix /dev/md0. I "guessed" a solution--create a new /dev/md1 device and refit the system to use that. I explain that fix below, in case the same problem hits other people.
But I'm still curious to know why it did not work.
Now the details:
The raid1 array was /dev/md0 and it used disks sdb1 and sdc1 and the one that failed was sdb1.
Here's what I saw while the new drive was being added:
# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc1[1] sdb1[2] 244195904 blocks [2/1] [_U] [==================>..] recovery = 94.4% (230658240/244195904) finish=6.9min speed=32396K/sec
unused devices: <none>
# mdadm --examine /dev/sdb1 /dev/sdb1: Magic : a92b4efc Version : 0.90.00 UUID : 37e6e9b6:34cdfcb2:63afba50:8b88d6fc Creation Time : Sat Aug 18 19:10:40 2007 Raid Level : raid1 Used Dev Size : 244195904 (232.88 GiB 250.06 GB) Array Size : 244195904 (232.88 GiB 250.06 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0
Update Time : Thu Oct 29 00:35:50 2009 State : clean Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Checksum : a557d3b3 - correct Events : 6874
Number Major Minor RaidDevice State this 2 8 17 2 spare /dev/sdb1
0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 spare /dev/sdb1
After the rebuild was done, here's the situation: the new drive is a spare:
# mdadm --examine /dev/sdc1 /dev/sdc1: Magic : a92b4efc Version : 0.90.00 UUID : 37e6e9b6:34cdfcb2:63afba50:8b88d6fc Creation Time : Sat Aug 18 19:10:40 2007 Raid Level : raid1 Used Dev Size : 244195904 (232.88 GiB 250.06 GB) Array Size : 244195904 (232.88 GiB 250.06 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0
Update Time : Thu Oct 29 00:35:50 2009 State : clean Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Checksum : a557d3c7 - correct Events : 6874
Number Major Minor RaidDevice State this 1 8 33 1 active sync /dev/sdc1
0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 spare /dev/sdb1
# mdadm --query /dev/md0 /dev/md0: 232.88GiB raid1 2 devices, 1 spare. Use mdadm --detail for more detail.
# mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Sat Aug 18 19:10:40 2007 Raid Level : raid1 Array Size : 244195904 (232.88 GiB 250.06 GB) Used Dev Size : 244195904 (232.88 GiB 250.06 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent
Update Time : Thu Oct 29 00:35:50 2009 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1
Rebuild Status : 97% complete
UUID : 37e6e9b6:34cdfcb2:63afba50:8b88d6fc Events : 0.6874
Number Major Minor RaidDevice State 2 8 17 0 spare rebuilding /dev/sdb1 1 8 33 1 active sync /dev/sdc1
After that, rebuilding seems finished:
# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc1[1] sdb1[2] 244195904 blocks [2/1] [_U]
But I have only 1 drive in the active array:
# mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Sat Aug 18 19:10:40 2007 Raid Level : raid1 Array Size : 244195904 (232.88 GiB 250.06 GB) Used Dev Size : 244195904 (232.88 GiB 250.06 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent
Update Time : Thu Oct 29 00:43:21 2009 State : clean, degraded Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1
UUID : 37e6e9b6:34cdfcb2:63afba50:8b88d6fc Events : 0.6880
Number Major Minor RaidDevice State 2 8 17 0 spare rebuilding /dev/sdb1 1 8 33 1 active sync /dev/sdc1
# mdadm --examine /dev/sdb1 /dev/sdb1: Magic : a92b4efc Version : 0.90.00 UUID : 37e6e9b6:34cdfcb2:63afba50:8b88d6fc Creation Time : Sat Aug 18 19:10:40 2007 Raid Level : raid1 Used Dev Size : 244195904 (232.88 GiB 250.06 GB) Array Size : 244195904 (232.88 GiB 250.06 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0
Update Time : Thu Oct 29 00:44:02 2009 State : clean Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Checksum : a557d5af - correct Events : 6882
Number Major Minor RaidDevice State this 2 8 17 2 spare /dev/sdb1
0 0 0 0 0 removed 1 1 8 33 1 active sync /dev/sdc1 2 2 8 17 2 spare /dev/sdb1
I tried a lot of ways to set this right. I tried "grow" the array, set the number of spares to 0, and so forth. No success.
After a lot of tries, I gave up trying to get /dev/md0 to work. So I stopped it, and the used the "--assume-clean" option to create a new array on md1. I found that suggestion here
http://neverusethisfont.com/blog/tags/mdadm/
# mdadm -S /dev/md0
# mdadm --create --assume-clean --level=1 --raid-devices=2 /dev/md1 /dev/sdc1 /dev/sdb1
That works! So I just needed to reset the configuration to use that. First, grab the metadata
# mdadm --detail --scan ARRAY /dev/md1 metadata=0.90 UUID=6a408f8b:515f605f:bfe78010:bc810f04
And revise the mdadm.conf file
# cat /etc/mdadm.conf
DEVICE /dev/sdb1 /dev/sdc1 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=6a408f8b:515f605f:bfe78010:bc810f04 devices=/dev/sdc1,/dev/sdb1
And I changed /etc/fstab to point at md1, not md0.
But why did /dev/md0 hate me in the first place?
I wonder if it was personal :(