FOLLOW-UP: RAID-1 (mirroring) disk failed; now what? [failed]
Will Partain
will.partain at verilab.com
Tue Sep 6 10:32:21 UTC 2005
FC3 machine, two disks, one mirrors the other. Software RAID.
Lovely. (Time passes.) The first disk (a Hitachi Deskstar 7K250, if
anyone cares) dies suddenly. The RAID software does the right thing
(more or less: the machine was unusable until after a reboot).
In a previous message, I asked "But now what?", and set out some
specific worries. I have now changed the disk -- no, it didn't go
particular well :-) -- and this message is to fill in what I've
learned. *Thank you* to Bill Rugolsky for chipping in with some
ideas; had I understood them correctly, I might've done better :-(
(Repeat Hint: a nice Fedora Docs topic; it's SORELY UNDERDOCUMENTED in
general :-) This article [http://mark.foster.cc/articles/raid-rebuild.html]
is an exception.) On to the fun, step by step...
* Get RAID to "stand down" re the dead disk; something like...
("mdadm --query --detail /dev/md<whatever>" to get the facts...)
# mdadm /dev/md0 --set-faulty /dev/sda1 --remove /dev/sda1
# mdadm /dev/md1 --set-faulty /dev/sda2 --remove /dev/sda2
# mdadm /dev/md2 --set-faulty /dev/sda5 --remove /dev/sda5
# mdadm /dev/md3 --set-faulty /dev/sda6 --remove /dev/sda6
[Yep, that worked fine.]
* Which physical disk is the guilty party?
(Oooh, shoulda thought about this _much_ earlier...) OK, I've gotta
yank one disk out, now which one is it? They're identical; oops.
Well, it's the first one (/dev/sda, also logged as 'ata1'), so if I
look in the motherboard manual and find where SATA channel 0 is
connected, and follow that wire... that'll be right, won't it?
[Update: yes, that is right.]
A better plan? -- take the cover off, fire the machine up, it's only
using one disk, right? -- so I feel which one is doing something,
and take out the other.
[Update: no, useless idea -- there's too much noise/vibration to be
able to distinguish a doing-something disk from an idle one.]
* Yank old, put in new disk.
(*Mark* it as '/dev/sda' or whatever, for future reference!)
* Will the machine's GRUBness work when the disk is yanked and replaced?
I just replaced what was /dev/sda with a fresh disk. Has GRUB
got hidden secrets, e.g. in the MBR, such that it won't boot without
the expected first disk?
Yes, this turned out to be exactly the case. Put in the new
disk, and it won't boot at all (because the first GRUB piece is in
the MBR of the disk I just yanked). So I booted with the rescue CD
instead; stay tuned.
* Getting the right partitioning on the replacement disk.
It would seem straightforward:
# sfdisk -d /dev/sdb > /tmp/my-disk-partitioning
# edit /tmp/my-disk-partitioning, replacing 'sdb' with 'sda'
# sfdisk /dev/sda < /tmp/my-disk-partitioning
FAILED: sfdisk off the rescue CD crashed hopelessly. I had to do
the partitioning "by hand" with the ever-reliable 'fdisk'.
* Sorting out GRUB (or "being able to boot the machine")
BIG SURPRISE: the rescue CD DOES NOT INCLUDE 'grub' or
'grub-install'!
What I _think_ I should've done is:
1. Before I did anything (i.e. with the old faltering /dev/sda), I
should've done...
dd if=/dev/sda of=/dev/sdb count=1
... to get GRUB-stage-0 into the MBR of the disk I'm about to
keep.
2. Don't I also need to mangle /boot/grub/grub.conf...
- to get rid of 'hiddenmenu' (so that I will be shown the choices)
- add at least one item to the grub menu so that grub will look
on the second disk (hd1)?
(This is the critical step, and I'm not sure about it.)
3. While rebooting (now with the new /dev/sda), tell the BIOS to
boot from the second disk
4. Upon successful boot from 2nd disk, do whatever I'm going to do... (below)
5. If I wish, reverse the process:
# copy grub-stage-0 onto MBR of new disk:
dd if=/dev/sdb of=/dev/sda count=1
Tell BIOS to go back to booting from first disk.
Note 1: the above is ***UNTESTED***
Note 2: this couldn't have worked if the old /dev/sda had been
*DEAD* rather than just "failing". I'm not at all sure how I
would've gotten grub-stage-0 onto the still-alive disk (/dev/sdb) at
all, were I unwise enough to do a 'shutdown -h'.
* A nice thing about this software RAID stuff...
As far as I could tell, you can use a degraded partition
(half-a-RAID-1) as a normal ext3 partition. So, for example,
you can do: mkdir /mnt/foo ; mount /dev/sdb2 /mnt/foo
However, this may just be playing with fire (see next item).
* A _big mistake_ that I made! --
At some point, with my new disk (/dev/sda) in and booted off the
rescue CD, I did something like...
dd if=/dev/sdb2 of=/dev/sda2
... i.e. brute-force copy all of a raid-1 partition onto its
presently-empty cousin. Theory: "/dev/sda is empty and not in play;
what harm can it do?"
Answer (I think): Lots. The RAID software snoops around on these
(type 'fd') partitions and silently decides what to make of the
situation. This is really not what you want in this delicate
state. Information is OK ("I've spotted a degraded array on
/dev/sdb2, which seems odd"), and doing nothing is OK, but "being
helpful" isn't. (Is there a kernel boot parameter to turn off RAID
cleverness?)
* Where I ended up: by mounting my degraded /dev/sdb2, I was able to
get access to a copy of GRUB, and run it. What I ended up with: a
/boot/grub/ that had all the grub stuff in it, *but no kernel* (not
really sure why). I told the BIOS to boot from 2nd disk, and it
duly dropped me into a GRUB prompt.
As I was no longer sure I even had a copy of the kernel on my disk,
I decided it was time for an FC4 install/upgrade :-)
* A happy ending: I had originally done an old-fashioned
several-partitions install (/boot, /, /var, /home). As I was forced
into an install, I was going to "lose" /boot and /. Happily,
however, the whole process stayed well away from the /home partition
-- what I was most keen to preserve in the first place.
* Anyway, after the install, all that was left was to tell the RAID
array for /home of its new friend:
# mdadm /dev/md3 --add /dev/sda6
Machine upgraded, no data lost, a little time wasted.
Again, I would advise more attention to this topic
(raid-you-set-up-a-year-or-two-ago-and-forgot-about => disk fails =>
getting things back painlessly). With disks as cheap as they are,
*everyone* should consider mirroring. But as things stand, I'd expect
a high failure rate even among the technically-inclined.
Will
More information about the users
mailing list