FC3 machine, two disks, one mirrors the other. Software RAID. Lovely. (Time passes.) The first disk (a Hitachi Deskstar 7K250, if anyone cares) dies suddenly. The RAID software does the right thing (more or less: the machine was unusable until after a reboot).
But now what? (Hint: a nice Fedora Docs topic; it's sorely underdocumented in general :-) This article [http://mark.foster.cc/articles/raid-rebuild.html] is an exception.) Please advise if I've got any of the following wrong...
* Get RAID to "stand down" re the dead disk; something like...
("mdadm --query --detail /dev/md<whatever>" to get the facts...)
# mdadm /dev/md0 --set-faulty /dev/sda1 --remove /dev/sda1 # mdadm /dev/md1 --set-faulty /dev/sda2 --remove /dev/sda2 # mdadm /dev/md2 --set-faulty /dev/sda5 --remove /dev/sda5 # mdadm /dev/md3 --set-faulty /dev/sda6 --remove /dev/sda6
* Which physical disk is the guilty party?
(Oooh, shoulda thought about this _much_ earlier...) OK, I've gotta yank one disk out, now which one is it? They're identical; oops.
Well, it's the first one (/dev/sda, also logged as 'ata1'), so if I look in the motherboard manual and find where SATA channel 0 is connected, and follow that wire... that'll be right, won't it?
A better plan? -- take the cover off, fire the machine up, it's only using one disk, right? -- so I feel which one is doing something, and take out the other.
* Will the machine's GRUBness work when the disk is yanked and replaced?
I am about to replace what was /dev/sda with a fresh disk. Has GRUB got hidden secrets, e.g. in the MBR, such that it won't boot without the expected first disk?
Even though the machine was RAID-1'd from day one, its grub.conf includes the comment...
#boot=/dev/sda1
... which is worrying.
Or should I expect to boot in with the rescue CD, get the RAID stuff tidied up, and only then get back to normal booting?
* Yank old, put in new disk.
(*Mark* it as '/dev/sda' or whatever, for future reference!)
* Getting the right partitioning on the replacement disk.
Seems straightforward:
# sfdisk -d /dev/sdb > /tmp/my-disk-partitioning # edit /tmp/my-disk-partitioning, replacing 'sdb' with 'sda' # sfdisk /dev/sda < /tmp/my-disk-partitioning
* Telling the RAID gubbins of its new friends:
This should be plain sailing...
# mdadm /dev/md0 --add /dev/sda1 # mdadm /dev/md1 --add /dev/sda2 # mdadm /dev/md2 --add /dev/sda5 # mdadm /dev/md3 --add /dev/sda6
Any looming disasters in all of that? Assuming non-trivial feedback, I will summarize back to the list. Thanks!
Will
On Tue, Aug 30, 2005 at 02:21:11PM +0100, Will Partain wrote:
Getting the right partitioning on the replacement disk.
Seems straightforward:
# sfdisk -d /dev/sdb > /tmp/my-disk-partitioning # edit /tmp/my-disk-partitioning, replacing 'sdb' with 'sda' # sfdisk /dev/sda < /tmp/my-disk-partitioning
You need to install GRUB stage1 in the first 448 byte of the MBR.
Also, the device name in the sfdisk input does not seem to matter, since you specify the device on the command line.
This should do the trick:
dd if=/dev/sdb of=/dev/sda count=1 sfdisk -d /dev/sdb | sed -e 's!/dev/sdb!/dev/sda!g' | sfdisk /dev/sda
[The 'sed' is optional.]
Telling the RAID gubbins of its new friends:
This should be plain sailing...
# mdadm /dev/md0 --add /dev/sda1 # mdadm /dev/md1 --add /dev/sda2 # mdadm /dev/md2 --add /dev/sda5 # mdadm /dev/md3 --add /dev/sda6
Any looming disasters in all of that? Assuming non-trivial feedback, I will summarize back to the list. Thanks!
I'm superstitious, so I "sleep 1" between md adds.
-Bill
FC3 machine, two disks, one mirrors the other. Software RAID. Lovely. (Time passes.) The first disk (a Hitachi Deskstar 7K250, if anyone cares) dies suddenly. The RAID software does the right thing (more or less: the machine was unusable until after a reboot).
In a previous message, I asked "But now what?", and set out some specific worries. I have now changed the disk -- no, it didn't go particular well :-) -- and this message is to fill in what I've learned. *Thank you* to Bill Rugolsky for chipping in with some ideas; had I understood them correctly, I might've done better :-(
(Repeat Hint: a nice Fedora Docs topic; it's SORELY UNDERDOCUMENTED in general :-) This article [http://mark.foster.cc/articles/raid-rebuild.html] is an exception.) On to the fun, step by step...
* Get RAID to "stand down" re the dead disk; something like...
("mdadm --query --detail /dev/md<whatever>" to get the facts...)
# mdadm /dev/md0 --set-faulty /dev/sda1 --remove /dev/sda1 # mdadm /dev/md1 --set-faulty /dev/sda2 --remove /dev/sda2 # mdadm /dev/md2 --set-faulty /dev/sda5 --remove /dev/sda5 # mdadm /dev/md3 --set-faulty /dev/sda6 --remove /dev/sda6
[Yep, that worked fine.]
* Which physical disk is the guilty party?
(Oooh, shoulda thought about this _much_ earlier...) OK, I've gotta yank one disk out, now which one is it? They're identical; oops.
Well, it's the first one (/dev/sda, also logged as 'ata1'), so if I look in the motherboard manual and find where SATA channel 0 is connected, and follow that wire... that'll be right, won't it? [Update: yes, that is right.]
A better plan? -- take the cover off, fire the machine up, it's only using one disk, right? -- so I feel which one is doing something, and take out the other. [Update: no, useless idea -- there's too much noise/vibration to be able to distinguish a doing-something disk from an idle one.]
* Yank old, put in new disk.
(*Mark* it as '/dev/sda' or whatever, for future reference!)
* Will the machine's GRUBness work when the disk is yanked and replaced?
I just replaced what was /dev/sda with a fresh disk. Has GRUB got hidden secrets, e.g. in the MBR, such that it won't boot without the expected first disk?
Yes, this turned out to be exactly the case. Put in the new disk, and it won't boot at all (because the first GRUB piece is in the MBR of the disk I just yanked). So I booted with the rescue CD instead; stay tuned.
* Getting the right partitioning on the replacement disk.
It would seem straightforward:
# sfdisk -d /dev/sdb > /tmp/my-disk-partitioning # edit /tmp/my-disk-partitioning, replacing 'sdb' with 'sda' # sfdisk /dev/sda < /tmp/my-disk-partitioning
FAILED: sfdisk off the rescue CD crashed hopelessly. I had to do the partitioning "by hand" with the ever-reliable 'fdisk'.
* Sorting out GRUB (or "being able to boot the machine")
BIG SURPRISE: the rescue CD DOES NOT INCLUDE 'grub' or 'grub-install'!
What I _think_ I should've done is:
1. Before I did anything (i.e. with the old faltering /dev/sda), I should've done...
dd if=/dev/sda of=/dev/sdb count=1
... to get GRUB-stage-0 into the MBR of the disk I'm about to keep.
2. Don't I also need to mangle /boot/grub/grub.conf...
- to get rid of 'hiddenmenu' (so that I will be shown the choices)
- add at least one item to the grub menu so that grub will look on the second disk (hd1)?
(This is the critical step, and I'm not sure about it.)
3. While rebooting (now with the new /dev/sda), tell the BIOS to boot from the second disk
4. Upon successful boot from 2nd disk, do whatever I'm going to do... (below)
5. If I wish, reverse the process:
# copy grub-stage-0 onto MBR of new disk: dd if=/dev/sdb of=/dev/sda count=1
Tell BIOS to go back to booting from first disk.
Note 1: the above is ***UNTESTED***
Note 2: this couldn't have worked if the old /dev/sda had been *DEAD* rather than just "failing". I'm not at all sure how I would've gotten grub-stage-0 onto the still-alive disk (/dev/sdb) at all, were I unwise enough to do a 'shutdown -h'.
* A nice thing about this software RAID stuff...
As far as I could tell, you can use a degraded partition (half-a-RAID-1) as a normal ext3 partition. So, for example, you can do: mkdir /mnt/foo ; mount /dev/sdb2 /mnt/foo
However, this may just be playing with fire (see next item).
* A _big mistake_ that I made! --
At some point, with my new disk (/dev/sda) in and booted off the rescue CD, I did something like...
dd if=/dev/sdb2 of=/dev/sda2
... i.e. brute-force copy all of a raid-1 partition onto its presently-empty cousin. Theory: "/dev/sda is empty and not in play; what harm can it do?"
Answer (I think): Lots. The RAID software snoops around on these (type 'fd') partitions and silently decides what to make of the situation. This is really not what you want in this delicate state. Information is OK ("I've spotted a degraded array on /dev/sdb2, which seems odd"), and doing nothing is OK, but "being helpful" isn't. (Is there a kernel boot parameter to turn off RAID cleverness?)
* Where I ended up: by mounting my degraded /dev/sdb2, I was able to get access to a copy of GRUB, and run it. What I ended up with: a /boot/grub/ that had all the grub stuff in it, *but no kernel* (not really sure why). I told the BIOS to boot from 2nd disk, and it duly dropped me into a GRUB prompt.
As I was no longer sure I even had a copy of the kernel on my disk, I decided it was time for an FC4 install/upgrade :-)
* A happy ending: I had originally done an old-fashioned several-partitions install (/boot, /, /var, /home). As I was forced into an install, I was going to "lose" /boot and /. Happily, however, the whole process stayed well away from the /home partition -- what I was most keen to preserve in the first place.
* Anyway, after the install, all that was left was to tell the RAID array for /home of its new friend:
# mdadm /dev/md3 --add /dev/sda6
Machine upgraded, no data lost, a little time wasted.
Again, I would advise more attention to this topic (raid-you-set-up-a-year-or-two-ago-and-forgot-about => disk fails => getting things back painlessly). With disks as cheap as they are, *everyone* should consider mirroring. But as things stand, I'd expect a high failure rate even among the technically-inclined.
Will
Will Partain wrote:
A _big mistake_ that I made! --
At some point, with my new disk (/dev/sda) in and booted off the rescue CD, I did something like...
dd if=/dev/sdb2 of=/dev/sda2
... i.e. brute-force copy all of a raid-1 partition onto its presently-empty cousin. Theory: "/dev/sda is empty and not in play; what harm can it do?"
Answer (I think): Lots. The RAID software snoops around on these (type 'fd') partitions and silently decides what to make of the situation. This is really not what you want in this delicate state. Information is OK ("I've spotted a degraded array on /dev/sdb2, which seems odd"), and doing nothing is OK, but "being helpful" isn't. (Is there a kernel boot parameter to turn off RAID cleverness?)
Doing this copies the UUIDs that the RAID software uses to identify each partition of the RAID. It's the RAID equivalent of having identical filesystem labels on two partitions so that mount doesn't know which one to use when you use the "LABEL=" syntax in fstab. You're best off just creating the partitions and adding them to the array; the RAID software will then restore the contents from the other mirror in the background whilst the system is running.
Paul.
Paul Howarth paul@city-fan.org writes, quoting me:
- A _big mistake_ that I made! -- At some point, with my new disk (/dev/sda) in and booted off the rescue CD, I did something like... dd if=/dev/sdb2 of=/dev/sda2 ... i.e. brute-force copy all of a raid-1 partition onto its presently-empty cousin. Theory: "/dev/sda is empty and not in play; what harm can it do?" Answer (I think): Lots. The RAID software snoops around on these (type 'fd') partitions and silently decides what to make of the situation. This is really not what you want in this delicate state. Information is OK ("I've spotted a degraded array on /dev/sdb2, which seems odd"), and doing nothing is OK, but "being helpful" isn't. (Is there a kernel boot parameter to turn off RAID cleverness?)
Doing this copies the UUIDs that the RAID software uses to identify each partition of the RAID. It's the RAID equivalent of having identical filesystem labels on two partitions so that mount doesn't know which one to use ...
Yes, I'm painfully aware :-(, and was shortly after I did it. I hadn't counted on the RAID software being clever behind my back...
I'm not sure that's what sunk me, though. I _seem_ to have somehow managed to end up without a /boot/vmlinuz<something> (previously mis-wrote: /boot/grub/vm...), or so it seemed. When you're typing grub commands into a pre-book grub> prompt, you're one step away from a soldering iron, with which I certainly cannot be trusted.
Once more: all this is something well worth robustifying.
Will
On Tue, Sep 06, 2005 at 11:32:21AM +0100, Will Partain wrote:
In a previous message, I asked "But now what?", and set out some specific worries. I have now changed the disk -- no, it didn't go particular well :-) -- and this message is to fill in what I've learned. *Thank you* to Bill Rugolsky for chipping in with some ideas; had I understood them correctly, I might've done better :-(
For the archive: I recently (last week) had an incident with FC3 where a kernel upgrade rendered the machine unbootable. The problem aparrently is a bug in grub where it is too stupid to figure out what BIOS drives correspond to /dev/md0 and it will ignore attempts to use /dev/sda instead; so instead of leaving the configuration alone, it merely makes the system completely unusable.
Solution:
http://xdroop.dhs.org/space/Linux/Grub+won%27t+install+to+raid-1
This bug has aparrently been with us since 7.3 (snarky observation deleted).