I have upgraded three VMware images previously running Fedora 29 to Fedora 30 using system-upgrade. All the images are running on the same ESC hardware and, so far as I know, use the same VMware foundation. The first worked flawlessly. Both the second and third failed in exactly the same way: they failed to reboot successfully after the upgrade.
When the reboot occurred, VMware reported a "alloc magic" error immediately and froze. Not being a VMware expert, I referred the issue to someone who is and he discovered that the MBR, master boot record, on the image was corrupt. After installing a new MBR, the system booted and the upgrade showed no further problems.
While investigating, we were sidetracked by the content of grub.cfg. It appears that grub no longer includes a section for each possible system to be booted. Not seeing any, we thought grub.cfg was corrupt also. But that is apparently not the case. After the MBR fix, the generated grub.cfg works properly.
On Wed, May 29, 2019 at 7:01 PM CLOSE Dave Dave.Close@us.thalesgroup.com wrote:
I have upgraded three VMware images previously running Fedora 29 to Fedora 30 using system-upgrade. All the images are running on the same ESC hardware and, so far as I know, use the same VMware foundation. The first worked flawlessly. Both the second and third failed in exactly the same way: they failed to reboot successfully after the upgrade.
When the reboot occurred, VMware reported a "alloc magic" error immediately and froze. Not being a VMware expert, I referred the issue to someone who is and he discovered that the MBR, master boot record, on the image was corrupt. After installing a new MBR, the system booted and the upgrade showed no further problems.
Not sure what would corrupt it but there is competition for LBA 0, the MBR, in that there's a bootloader portion in the first ~440 bytes and then a partition table from that point until the 512th byte. So whenever something changes a partition or a boot flag (active bit) or bootloader jump code, there's a risk. This was such a well known problem it directly affected GPT. For one, don't use LBA 0. Two, make two copies in two totally different locations. Three, checksum everything. Four, give the bootloader its own home, no sharing.
While investigating, we were sidetracked by the content of grub.cfg. It appears that grub no longer includes a section for each possible system to be booted. Not seeing any, we thought grub.cfg was corrupt also. But that is apparently not the case. After the MBR fix, the generated grub.cfg works properly.
Normal behavior, new feature. https://fedoraproject.org/wiki/Changes/BootLoaderSpecByDefault
Chris Murphy wrote:
Not sure what would corrupt it but there is competition for LBA 0, the MBR, in that there's a bootloader portion in the first ~440 bytes and then a partition table from that point until the 512th byte. So whenever something changes a partition or a boot flag (active bit) or bootloader jump code, there's a risk. This was such a well known problem it directly affected GPT. For one, don't use LBA 0. Two, make two copies in two totally different locations. Three, checksum everything. Four, give the bootloader its own home, no sharing.
Interesting advice but it leaves me without a course of action. How does one avoid using LBA 0? Doesn't the boot loader already have its own location?
On 5/30/19 10:06 AM, CLOSE Dave wrote:
Chris Murphy wrote:
Not sure what would corrupt it but there is competition for LBA 0, the MBR, in that there's a bootloader portion in the first ~440 bytes and then a partition table from that point until the 512th byte. So whenever something changes a partition or a boot flag (active bit) or bootloader jump code, there's a risk. This was such a well known problem it directly affected GPT. For one, don't use LBA 0. Two, make two copies in two totally different locations. Three, checksum everything. Four, give the bootloader its own home, no sharing.
Interesting advice but it leaves me without a course of action. How does one avoid using LBA 0? Doesn't the boot loader already have its own location?
You could switch to using EFI if vmware supports that. Otherwise you can't. The BIOS loads the initial boot code from sector 0. That initial code loads the rest of the boot loader from elsewhere. Updating that initial boot loader code always has some amount of risk since the partition table is also stored in that sector.
On Thu, May 30, 2019 at 11:07 AM CLOSE Dave Dave.Close@us.thalesgroup.com wrote:
Chris Murphy wrote:
Not sure what would corrupt it but there is competition for LBA 0, the MBR, in that there's a bootloader portion in the first ~440 bytes and then a partition table from that point until the 512th byte. So whenever something changes a partition or a boot flag (active bit) or bootloader jump code, there's a risk. This was such a well known problem it directly affected GPT. For one, don't use LBA 0. Two, make two copies in two totally different locations. Three, checksum everything. Four, give the bootloader its own home, no sharing.
Interesting advice but it leaves me without a course of action. How does one avoid using LBA 0? Doesn't the boot loader already have its own location?
It has multiple locations. One of which is the first 440 bytes of LBA 0.
We'd need to look at LBA 0 on a broken system to do an autopsy. Once it's fixed, the evidence of what stepped on it is wiped away.
On Thu, May 30, 2019 at 12:36 PM Samuel Sieb samuel@sieb.net wrote:
On 5/30/19 10:06 AM, CLOSE Dave wrote:
Chris Murphy wrote:
Not sure what would corrupt it but there is competition for LBA 0, the MBR, in that there's a bootloader portion in the first ~440 bytes and then a partition table from that point until the 512th byte. So whenever something changes a partition or a boot flag (active bit) or bootloader jump code, there's a risk. This was such a well known problem it directly affected GPT. For one, don't use LBA 0. Two, make two copies in two totally different locations. Three, checksum everything. Four, give the bootloader its own home, no sharing.
Interesting advice but it leaves me without a course of action. How does one avoid using LBA 0? Doesn't the boot loader already have its own location?
You could switch to using EFI if vmware supports that.
Another option is using GPT with BIOSBoot partition. It's fairly straightforward to just convert MBR to GPT. The MBR ends up as a PMBR which is a fairly generic structure, the actual partition map is in LBA 2-33, and also duplicated at the end of the drive. So at least there's some separation of unrelated things.
I'm pretty sure VMware does support both BIOS+GPT and UEFI+GPT. But for all practical purposes, the latter is a reinstall, whereas the former can be done in place, if a little tricky because the creation of GPT has a good chance of stomping on the stage 2 bootloader in what was the MBR gap.
I'm pretty sure (been a while since I checked) that neither the bootloader nor kernel care if the primary GPT header or table fail checksum verification, they just use it in the blind anyway. No fallback to the backup. And no fail. That's kindof annoying.
On Thu, May 30, 2019 at 4:00 PM Chris Murphy lists@colorremedies.com wrote:
I'm pretty sure (been a while since I checked) that neither the bootloader nor kernel care if the primary GPT header or table fail checksum verification, they just use it in the blind anyway. No fallback to the backup. And no fail. That's kindof annoying.
Nice. I'm right and wrong at the same time. https://bugzilla.kernel.org/show_bug.cgi?id=63591
If that behavior is still the case (more than five years after I filed the bug), the kernel will faceplant upon discovery of a corrupt primary GPT. Now, the only two ways I got that far is if GRUB didn't care about the corruption, or it used the backup GPT. To force the kernel to use the backup, the 'gpt' parameter needs to be used (haha yeah that's not confusing at all, nosiree).
Chris Murphy wrote:
We'd need to look at LBA 0 on a broken system to do an autopsy. Once it's fixed, the evidence of what stepped on it is wiped away.
I have two more systems to upgrade, probably next week. If one of them fails in the same way, I'll capture the MBR and post here.