[PATCH] mkinitrd rescue mode

Mon Jun 6 19:56:14 UTC 2005

On Mon, 2005-06-06 at 15:31 -0400, Peter Jones wrote:
> And if we're getting those, then there *is* a real mkinitrd bug, or a
> real module_upgrade bug, or a real new-kernel-pkg bug.  If Engineering
> isn't seeing BZ tickets on them, then Engineering and GPS both lose.  No
> new nash/mkinitrd features will help that at all.
> 

These problems are often configuration issues, or someone has removed
the /initrd directory (not such a problem with initramfs, but very
common with initrd problems).

On occasion this is a kernel issue, where a device is no longer being
detected correctly for some reason. In these cases, we _do_ open BZ
tickets and engineering gets them. Unfortunately, we it's often after a
lot of time working with the user before we can determine that this is
the case. Making this easier to troubleshoot would be helpful for users
and us.

It's tough to simply categorize all of the instances where we see
problems with booting. If I could then we could just eliminate them,
this is an attempt to get better visibility into an area where we are
generally flying blind.

> > The most helpful messages telling us what the problem is usually have 
> > scrolled off the screen. We generally have to fix this based on experience
> > and guesswork.
> 
> One (much more acceptable) solution for that would be a much simpler
> switchroot change -- make it look for a command line option "pause", and
> if it finds it, wait for the user to hit "enter" before executing init.
> For the overwhelmingly vast majority of boxes that'll get you enough
> scrollback in shift+pageup to see everything since kernel started.
> 

I'm open to this. I don't think it would be as helpful as a true rescue
mode, but it's better than nothing.

> > You can have the user set up a serial console, but users who are savvy
> > enough to figure out how to do that are generally able to troubleshoot
> > their own booting problems.  Being able to tell a user to add "rescue" 
> > to the command line and then to walk them through some commands (like 
> > dmesg) to try to determine the problem would be very helpful for a
> > number of different reasons.
> 
> This is something of a contradiction -- you've just said the user isn't
> savvy enough to debug boot problems, and then suggested a step that'd
> make them only very marginally easier.  

True, but this gives us _some_ ability to see what the problem is. Being
able to tell them (for instance): Ok run "cat /proc/mounts" and tell me
if /dev/hda2 is listed would be very helpful. This tool wouldn't be
useful for users with no clue on their own, but would be helpful for
helping such users over the phone.

> > It would also give people the ability to try to rescue corrupted root
> > filesystems without needing special infrastructure (like a PXE server) 
> > and without having to physically be near the machine (with a CD boot).
> 
> This is a strawman -- your scenario is that they've just installed or
> upgraded, in which case they've already set up this infrastructure or
> are already close to the box.
> 

Not necessarily -- serial consoles are very common in datacenters.

> > Since we're discussing this, I posted a proposed patch this morning to
> > nash to clean out the initramfs prior to the switchroot:
> > 
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=159636
> 
> Thank you for doing this.
> 

Gladly :-)

> > If nash is able to clean out the initramfs before switching the root, is
> > there any reason _not_ to have some useful tools in it?
> 
> Added complexity is bad.  Sometimes we have to add some, and that sucks.
> When we don't have to, in general the answer is "no".
> 
> So if you *really* think this is worth doing, I'm more likely to take a
> change to add a command line argument which causes nash to execute
> something on a second initramfs cpio ball in lieu of switchroot/init,
> and then an entirely separate image (unrelated to mkinitrd) to do your
> rescue stuff.
> 

I'm with you -- extra complexity is generally bad, but in this case I
don't see where it's harmful. If you don't want to use it, don't add
"rescue" to the commandline (or generate your initrd images without
it). 

To make sure I understand what you're proposing as an alternative...

You're proposing having a secondary cpio image containing the "rescue"
tools. We'd then pass this as a secondary initrd image to GRUB. We could
then either use the rescue command line parameter (more or less as-is),
or could key off the presence of something from the rescue image to
enter rescue mode.

If so, that would be acceptable to me, but I'll have to see how multiple
cpio archives work in practice (I've never used them).

-- 
Jeff Layton <jlayton at redhat.com>