Random kernel update breakage

Wed Aug 25 12:06:06 UTC 2010

On Wed, 25 Aug 2010 07:10:11 -0400, Sam wrote:

> > And from those, who are hit by it occasionally, nobody attempts at
> > debugging it or collecting all the info. Dunno whether the root cause of
> > the problem is known by anyone.
> 
> I think that's because nobody really knows how to debug it. At least I 
> don't, for example. The boot process is an obscure bit of knowledge.

Okay, but collecting details, which _might_ be important, would still be
possible. Such as do you use Hibernate, Suspend, partition resizing tools,
special backup software to restore from backups, fs checks that corrected
hard errors? Perhaps you've applied BIOS updates some time before the
kernel update?

> If I knew how to save the broken MBR then I would've done it.

dd if=/dev/sda of=mbr.bin bs=512 count=1

for example will include the partition table. With bs=446, the partition
table would not be included.

My theory is that if the MBR is unmodified when the problem is encountered,
something may have moved GRUB's stage* files in a way it couldn't find
them anymore in their previous location. Then it would load and execute
unexpected garbage. Since the boot record doesn't contain the extfs support,
it doesn't understand filesystems before it loads its missing parts.

If you run into the problem regularly, you could also save the "stat" 
output for GRUB's /boot/grub/stage* files in its working state and
compare with when it no longer works.

> I'm not sure that it's the MBR itself that gets trashed, or at least not 
> always. One time the machine started beeping the console bell constantly. 
> Another time I ended up staring at "GRUB" in white-on-black. This time, I 
> just ended up at an empty, blank, text console screen.

Might be evidence of executing "random" data found on the harddisk
where it expected its modules.