Check your /etc/default/grub, if you use raid 1.

Sun Jul 29 14:42:26 UTC 2012

Bruno Wolff III writes:

> On Sun, Jul 29, 2012 at 10:02:00 -0400,
>   Sam Varshavchik <mrsam at courier-mta.com> wrote:
>> There's a long standing combination of two bugs: the list of rd.md.uuid boot  
>> parameters generated by anaconda for /etc/default/grub may not include the  
>> raid uuid of non-stock partitions like /home; and although the ramfs  
>> initscript autodiscovers all raid volumes present, sometimes (not always,  
>> I'll estimate 5% of the time) if a uuid is not enumerated in the boot  
>> parameters, one of the drives in the raid 1 volume may not get assembled at  
>> boot.
>
> My raid info is /etc/mdadm.conf and that is what gets used by dracut when  
> building an initramfs as far as I can tell.

All I know is in F16 I discovered that a raid 1 volume whose uuid does not  
get enumerated in the rd.md.uuid kernel boot parameters will come up with  
one drive not in the array, maybe 5% of the time. I wasn't the only one  
affected, there was another list member that reported the same bug, and that  
putting the uuid back into grub.cfg and /etc/default/grub fixed it.

>> There's probably a third bug in here: mdmonitor should've mailed me when an  
>> array came up degraded at boot (I suspect that because mdmonitor gets  
>> started so early in the boot process, not all the moving pieces are there  
>> for mail delivery to happen). Eventually, you'll boot again with both drives  
>> in the array somehow, except they'll be out of sync, resulting in massive  
>> corruption. If you're lucky, you'll boot just with the other drive, and  
>> discover that your filesystem's contents are weeks/months out of date, and  
>> maybe you'll be lucky enough to figure out what happen, and switch back to  
>> the other drive and resync. But, not everyone's so lucky.
>
> That doesn't sound right. You might come up using the incorrect raid member,  
> but you should come up with two out of sync drives. (Maybe this could happen  
> with some non-default setups, where the elements aren't labelled.)

According to mdadm --detail, I have a "Name" label on it.

All I know is that I spent half a day wondering why, every time I fscked  
this partition I found more crap. The other half of the day was spent  
resyncing the volume, after I figured out that the drives were not synced.

And, fail/remove/add did not resync the drive. Because the volume uses an  
internal bitmap: oh, the newly-added drive has a valid bitmap, apparently  
from the same volume, so let's add this drive without resyncing it!

That, I think is a bug. Failing a drive should zero its superblock, to force  
a real resync if it gets added back to the array.

> There was a recent bug with raid arrays that could result in some elements  
> failing when shutting down. It doesn't directly corrupt the data though.  
> There is information about this bug here: http://neil.brown.name/blog/ 
> 20120615073245

Since F17 is on 3.4, and this seems to indicate that only some 3.2 and 3.3  
kernels might have an issue, doesn't look like this is related.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/users/attachments/20120729/98308d26/attachment.sig>