I could you use some help with ${SUBJECT}. I posted the details in discussion [1], but have yet to receive a response. I thought maybe folks on the list may have an idea.
I'm kinda lost as to where this is going wrong. Feel free to reply either on discussion or here. I'd appreciate any help I can get.
Thank you,
On 12/27/23 15:05, Sandro wrote:
I could you use some help with ${SUBJECT}. I posted the details in discussion [1], but have yet to receive a response. I thought maybe folks on the list may have an idea.
I'm kinda lost as to where this is going wrong. Feel free to reply either on discussion or here. I'd appreciate any help I can get.
The missing link:
[1] https://discussion.fedoraproject.org/t/system-fails-to-boot-after-dnf-system...
On 12/27/23 15:14, pgnd wrote:
what's the output of:
mdadm -Es cat /etc/mdadm.conf
# mdadm -Es ARRAY /dev/md/5 metadata=1.1 UUID=39295d93:e5a75797:b72287f3:51563755 name=urras.penguinpee.nl:5 ARRAY /dev/md/1 metadata=1.1 UUID=4a2c44b5:25f2a6c9:0e7f6cae:37a8a9cc name=urras.penguinpee.nl:1 spares=1 ARRAY /dev/md/54 metadata=1.2 UUID=fb919273:c6bfb891:ea1ca83c:0a8b3ad7 name=urras.penguinpee.nl:54
# cat /etc/mdadm.conf ARRAY /dev/md/5 metadata=1.1 name=urras.penguinpee.nl:5 UUID=39295d93:e5a75797:b72287f3:51563755 ARRAY /dev/md/54 metadata=1.2 name=urras.penguinpee.nl:54 UUID=fb919273:c6bfb891:ea1ca83c:0a8b3ad7 ARRAY /dev/md/1 metadata=1.1 spares=1 name=urras.penguinpee.nl:1 UUID=4a2c44b5:25f2a6c9:0e7f6cae:37a8a9cc MAILADDR root PROGRAM /usr/share/doc/mdadm/syslog-events
yours,
mdadm -Es ARRAY /dev/md/5 metadata=1.1 UUID=aa...aa name=urras.penguinpee.nl:5 ARRAY /dev/md/1 metadata=1.1 UUID=bb..bb name=urras.penguinpee.nl:1 spares=1 ARRAY /dev/md/54 metadata=1.2 UUID=cc..cc name=urras.penguinpee.nl:54
cat /etc/mdadm.conf ARRAY /dev/md/5 metadata=1.1 name=urras.penguinpee.nl:5 UUID=aa...aa ARRAY /dev/md/54 metadata=1.2 name=urras.penguinpee.nl:54 UUID=cc..cc ARRAY /dev/md/1 metadata=1.1 spares=1 name=urras.penguinpee.nl:1 UUID=bb..bb
mine,
mdadm -Es ARRAY /dev/md/0 metadata=1.2 UUID=ee..ee name=svr041:0 ARRAY /dev/md/1 metadata=1.2 UUID=ff..ff name=svr041:1 ARRAY /dev/md/2 metadata=1.2 UUID=dd..dd name=svr041:2
cat /etc/mdadm.conf ARRAY /dev/md0 level=raid10 num-devices=4 metadata=1.2 UUID=ee..ee name=svr041:0 ARRAY /dev/md1 level=raid10 num-devices=4 metadata=1.2 UUID=ff..ff name=svr041:1 ARRAY /dev/md2 level=raid1 num-devices=2 metadata=1.2 UUID=dd..dd name=svr041:2
i explicitly add the
level= num-devices
i've had issues long ago with mis-assembly that that cured. whether it's STILL a problem, i don't know; all my array spec contain contain these, and don't cause harm. it's now SOP here.
i'd at least consider checking ...
On 12/27/23 16:47, pgnd wrote:
i explicitly add the
level= num-devices
i've had issues long ago with mis-assembly that that cured. whether it's STILL a problem, i don't know; all my array spec contain contain these, and don't cause harm. it's now SOP here.
i'd at least consider checking ...
Sure. It's worth a try. Although, I've read that /etc/mdadm.conf is kinda obsolete.
If it works, I'd still wonder why md54 comes up fine. That entry is also missing `level` and `num-devices`.
If it works, I'd still wonder why md54 comes up fine. That entry is also missing `level` and `num-devices`.
1st WAG ?
/dev/md/54 is 'just' raid1, without an atypical spare
raid5 & raid1+spare are less common. whether autoassembly is smart enough to pick up the diffs without the explicit spec in /etc/mdadm.conf? again, dunno without trying.
On 12/27/23 17:10, pgnd wrote:
If it works, I'd still wonder why md54 comes up fine. That entry is also missing `level` and `num-devices`.
1st WAG ?
WAG?
/dev/md/54 is 'just' raid1, without an atypical spare
It's not. It's a raid5 without any spare just like md5. Only difference between the two is the metadata version, 1.1 vs. 1.2. And the partitions making up md54 have a bunch of additional entries like UUID_SUB as reported by `blkid`. More details in the discussion post.
raid5 & raid1+spare are less common. whether autoassembly is smart enough to pick up the diffs without the explicit spec in /etc/mdadm.conf? again, dunno without trying.
I added `level` and `num-devices` to all entries in mdadm.conf and rebooted. It didn't change anything. Manual assembly still freezes the system as well.
WAG?
https://www.urbandictionary.com/define.php?term=wild%20ass%20guess
I added `level` and `num-devices` to all entries in mdadm.conf and rebooted. It didn't change anything. Manual assembly still freezes the system as well.
hm
(1) are the mods explicitly in the initrd? (2) can you boot from a usb key with a f39 live-iso and see if the arrays assemble? if they do, it's likely your config if they don't, the probs in the array itself (bad metadata, out of sync, etc ...)
On 12/27/23 17:28, pgnd wrote:
WAG?
https://www.urbandictionary.com/define.php?term=wild%20ass%20guess
;) I should have guessed...
I added `level` and `num-devices` to all entries in mdadm.conf and rebooted. It didn't change anything. Manual assembly still freezes the system as well.
hm
(1) are the mods explicitly in the initrd?
Yes, they are.
(2) can you boot from a usb key with a f39 live-iso and see if the arrays assemble? if they do, it's likely your config if they don't, the probs in the array itself (bad metadata, out of sync, etc ...)
Well, I booted from a F38 USB everything ISO. It didn't assemble any of the arrays. But I was able to do so manually and subsequently mount one of the logical volumes on it readonly.
I'd say that pretty much rules out a failure on the arrays. Two at the same time while a third is unaffected seems unlikely as well since they all share the same spinning metal underneath. It's not impossible. But, if so, I'd better buy myself a lottery ticket. ;)
I'll try with a F39 live ISO as well. If nothing else, it could confirm the issue to be with/in F39.
On 12/27/23 17:36, Sandro wrote:
I'll try with a F39 live ISO as well. If nothing else, it could confirm the issue to be with/in F39.
I did that. Right after boot only one of the arrays was assembled. In /proc/mdstat is was listed as "active (auto-read-only)" and the component devices matched with what is known on my system as md54.
No entry in the journal regarding any of the other arrays. No attempted assembly. Nothing.
I stopped the array and ran `mdadm --assemble --verbose --scan --no-degraded` and all arrays were assembled without any issues. In the verbose output of the command `mdadm` told me:
mdadm: No super block found on /dev/sda2 (Expected magic a92b4efc, got 6d746962)
That was repeated for all the component devices of md1 and md5. For drives / partitions not being component devices it reported:
mdadm: No super block found on /dev/sda1 (Expected magic a92b4efc, got 00000000)
instead. In contrast, about the component devices of md54, `mdadm` had the following to report:
mdadm: /dev/sda4 is identified as a member of /dev/md/urras.penguinpee.nl:54, slot 0.
This smells very much like something is off with the version 1.1 superblock. This being the only notable difference between arrays md1 and md5 and array md54. Yet, I have no clue as to what or what to do about it. :-\
ugh.
without seeing all the details, unfound superblocks aren't good.
if you were assembling with the wrong metadata format, that'd be unsurprising. but, it sounds like these _were_ working for you at some point.
if you're hoping to explore/identify/repair any damage, there's this for a good start,
https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
this too,
https://wiki.archlinux.org/title/RAID
i'd recommend subscribing asking at,
https://raid.wiki.kernel.org/index.php/Asking_for_help
before guessing. a much better place to ask than here.
even with good help from the list, i've had mixed luck with superblock recovery -- best, when able to find multiple clean copies of backup superblocks on the array/drives. -- worst, lost it all
given the change in behavior, and the older metadata, i'd consider starting fresh. wiping the array & disks, scanning/mapping bad blocks, reformatting & repartitioning, creating new arrays with lastest metadata, and restoring from backup.
if you've still got good harwdare, should be good -- better than the uncertainty. yup, it'll take awhile. but, so might the hunt & repair process.
On 12/27/23 22:48, pgnd wrote:
without seeing all the details, unfound superblocks aren't good.
But isn't the information `mdadm --examine` prints coming from the superblock stored on the device? The magic number, that this command reports, matches what was expected (a92b4efc). I can access that information for each and every of the component devices.
if you were assembling with the wrong metadata format, that'd be unsurprising. but, it sounds like these_were_ working for you at some point.
Yes, they were working right until I upgraded from f37 to f39, which was mostly (actually only) hammering the SSD, which holds the root volume.
if you're hoping to explore/identify/repair any damage, there's this for a good start,
https://raid.wiki.kernel.org/index.php/Recovering_a_damaged_RAID
this too,
https://wiki.archlinux.org/title/RAID
i'd recommend subscribing asking at,
https://raid.wiki.kernel.org/index.php/Asking_for_help
before guessing. a much better place to ask than here.
even with good help from the list, i've had mixed luck with superblock recovery -- best, when able to find multiple clean copies of backup superblocks on the array/drives. -- worst, lost it all
Thanks. I will ask for help on the mailing list. I actually got good hope since I'm able to manually assemble and access the data. But first I will do some reading up.
given the change in behavior, and the older metadata, i'd consider starting fresh. wiping the array & disks, scanning/mapping bad blocks, reformatting & repartitioning, creating new arrays with lastest metadata, and restoring from backup.
if you've still got good harwdare, should be good -- better than the uncertainty. yup, it'll take awhile. but, so might the hunt & repair process.
Yeah. It's currently still on the bottom of my list. If all else fails, I will have no other option I guess. At moment I'm a bit torn apart between wanting to understand what's going and wanting to be able to use my system again.
Going through all the information, I just discovered that the oldest of the arrays is close to ten years old and one of the disks has been part of the setup right from the start (Power_On_Hours: 80952). I've replaced disks whenever the need arose, never ran into trouble until now...
Thanks for sparring! Whatever the outcome, I'll report here and on discussion.
On 12/27/23 23:41, Sandro wrote:
Thanks for sparring! Whatever the outcome, I'll report here and on discussion.
As promised, here is the outcome of a lengthy investigation. While I was on the right track right from the start, I failed to recognize (something about moving parts, some trees and a forest).
The culprit turns out to be `blkid`, which returns incomplete information for, in my case, the raid partitions using version 1.1 superblock.
For more info see:
https://bugzilla.redhat.com/show_bug.cgi?id=2249392
or, for the journey diary, see:
https://discussion.fedoraproject.org/t/system-fails-to-boot-after-dnf-system...
Cheers,
On 12/27/23 08:20, Sandro wrote:
I added `level` and `num-devices` to all entries in mdadm.conf and rebooted. It didn't change anything. Manual assembly still freezes the system as well.
You need to also update the initramfs using dracut or the modified mdadm.conf won't be available at boot.
On 12/28/23 00:04, Samuel Sieb wrote:
On 12/27/23 08:20, Sandro wrote:
I added `level` and `num-devices` to all entries in mdadm.conf and rebooted. It didn't change anything. Manual assembly still freezes the system as well.
You need to also update the initramfs using dracut or the modified mdadm.conf won't be available at boot.
Ah, good point. I'll give that a shot tomorrow. For now I will have a good rest knowing that there's still something I haven't tried (correctly). ;)
On 12/28/23 00:47, Sandro wrote:
On 12/28/23 00:04, Samuel Sieb wrote:
On 12/27/23 08:20, Sandro wrote:
I added `level` and `num-devices` to all entries in mdadm.conf and rebooted. It didn't change anything. Manual assembly still freezes the system as well.
You need to also update the initramfs using dracut or the modified mdadm.conf won't be available at boot.
Ah, good point. I'll give that a shot tomorrow. For now I will have a good rest knowing that there's still something I haven't tried (correctly). ;)
I tried two things:
1) adding `level` and `num-devices` to the ARRAY definitions 2) Adding DEVICE section and `devices` to the ARRAY lines
With both changes I updated initramfs and rebooted. Unfortunately, the issue remains unsolved.
I have now reached out to the linux-raid mailing list. 🤞
also make sure your drivers are in the initrd
lsinitrd | grep -Ei "kernel/drivers/md/raid" -rw-r--r-- 1 root root 10228 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid0.ko.xz -rw-r--r-- 1 root root 35376 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid10.ko.xz -rw-r--r-- 1 root root 26912 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid1.ko.xz -rw-r--r-- 1 root root 90144 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid456.ko.xz
where, here,
grep -i raid /etc/dracut.conf.d/99-local.conf add_dracutmodules+=" mdraid " add_drivers+=" raid0 raid1 raid10 raid456 "
On 12/27/23 17:03, pgnd wrote:
also make sure your drivers are in the initrd
lsinitrd | grep -Ei "kernel/drivers/md/raid" -rw-r--r-- 1 root root 10228 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid0.ko.xz -rw-r--r-- 1 root root 35376 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid10.ko.xz -rw-r--r-- 1 root root 26912 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid1.ko.xz -rw-r--r-- 1 root root 90144 Nov 15 19:00 usr/lib/modules/6.6.8-200.fc39.x86_64/kernel/drivers/md/raid456.ko.xz
I have raid0.ko.xz and raid456.ko.xz present in initrd.
where, here,
grep -i raid /etc/dracut.conf.d/99-local.conf add_dracutmodules+=" mdraid " add_drivers+=" raid0 raid1 raid10 raid456 "
On my system /etc/dracut.conf.d/ is empty and /etc/dracut.conf only contains comments. Seeing that the required modules are in initrd, I suppose I don't need to spell them out explicitly.
I suppose without the modules loaded I wouldn't be able to perform any MD operations at all.
Thanks for the pointers, anyway. It helps in understanding how all these pieces fit together.
On 12/27/23 06:05, Sandro wrote:
I could you use some help with ${SUBJECT}. I posted the details in discussion [1], but have yet to receive a response. I thought maybe folks on the list may have an idea.
I'm kinda lost as to where this is going wrong. Feel free to reply either on discussion or here. I'd appreciate any help I can get.
I have a similar issue that happened when I upgraded to F38. I have a mixed raid10 which is a 2 drive raid0 joined using raid1 with an nvme partition. For some reason, only one of the raid0 partitions gets put in the array automatically, so the boot fails. I have to manually re-assemble the raid0 and then the rest gets automatically assembled. There's also another raid1 array that doesn't have any issues.
I don't want to hijack this thread, so I'm not really asking for a solution at this point. I'm just wondering if there was some change starting in F38 that is causing weird raid issues like this. Maybe some sort of timing or ordering issue?
My metadata is all 1.2, so it's not that.
On 12/27/23 15:12, Samuel Sieb wrote:
On 12/27/23 06:05, Sandro wrote:
I could you use some help with ${SUBJECT}. I posted the details in discussion [1], but have yet to receive a response. I thought maybe folks on the list may have an idea.
I'm kinda lost as to where this is going wrong. Feel free to reply either on discussion or here. I'd appreciate any help I can get.
I have a similar issue that happened when I upgraded to F38.
I was wrong about the version. It must have started with F36 or F37 because it's currently on F38 and I was hoping the upgrade to F38 would fix it. So probably not related to the issue of this thread.
On Wed, 27 Dec 2023 15:05:29 +0100 Sandro lists@penguinpee.nl wrote:
I could you use some help with ${SUBJECT}. I posted the details in discussion [1], but have yet to receive a response. I thought maybe folks on the list may have an idea.
I'm kinda lost as to where this is going wrong. Feel free to reply either on discussion or here. I'd appreciate any help I can get.
This may be completely unrelated but I had a similar experience when installing F39 on one of my machines with OS on SSD and data on spinning rust with MDRAID volumes.
I'm in the habit of just specifying my two SSDs as devices to use in the installer and then configuring everything else (mostly using ansible) post-install. When I booted the new F39 (server) installation and tried to set up the LVM arrays, most of them couldn't be found. This turned out to be because the file /etc/lvm/devices/system.devices was present, and contained only the devices I had specified in the installer. This seemed to be new behaviour in Fedora 39.
This fix for this was to do this: # vgimportdevices -a # vgchange -ay
I was then able to access my MDRAID volumes.
Hopefully this can help somebody.
Regards, Paul.