On Thu, Nov 28, 2019 at 2:30 AM Kamil Paral kparal@redhat.com wrote:
On Wed, Nov 27, 2019 at 9:17 PM Chris Murphy lists@colorremedies.com wrote:
A fair point about VM testing is whether the disk cache mode affects the outcome. I use unsafe because it's faster and I embrace misery. I think QA bots are now mostly using unsafe because it's faster too. So depending on the situation it may be true that certain corruptions are expected if unsafe is used, but I *think* unsafe is only unsafe in the event the host crashes or experiences a power failure. I do forced power offs of VMs all the time and never lose anything, in the case of ext4 and XFS, journal replay always makes the file system consistent again. And journal replay in that example is expected, not a bug.
By "that example", do you mean the story you just described, or the "bad result example" from the test case?
The story, which is too verbose and also confusing, so just ignore it. :-D
Because in that test case example, if the machine was correctly powered off/rebooted, there should be no reason to reply journal or see dirty bits.
That should be true, yes.
How to test, step 2 and 3: This only applies to FAT and ext4. XFS and Btrfs have no fsck, both depend on log replay if there was an unclean shutdown. Also, there are error messages for common unclean shutdowns, and error messages for uncommon problems. I think we only care about the former, correct?
I believe so. Is there a tool that could tell us whether the currently mounted drives were mounted cleanly, or some error correction had to be performed? Because this is quickly getting into territory where we will need to provide a large amount of examples and rely on people parsing the output, comparing and (mis-)judging. The wording in journal output can change any time as well. And I don't really like that.
FAT, ext4, and XFS all have a kind of "dirty bit" set upon mount. It's removed when cleanly unmounted. Therefore if the file system isn't mounted, but the "dirty bit" is set, it can be assumed it was not cleanly unmounted. Both kernel code and each file system's fsck can detect this, and the message you see depends on which discovers the problem first. The subsequent messages about how this problem is handled, I think we can ignore. As you say, it will be variable. All we care about is the indicator that it was not properly unmounted. Here are those indicators for each file system:
FAT fsck (since /etc/fstab sets EFI system partition fs_passno to 2, this is what's displayed for default installations) Nov 28 12:04:21 localhost.localdomain systemd-fsck[681]: 0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
FAT kernel [ 205.317346] FAT-fs (vdb1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
ext4 fsck (since /etc/fstab sets /, /boot, /home fs_passno to 1 or 2, this is what's displayed for default installations) Nov 28 12:07:21 localhost.localdomain systemd-fsck[681]: /dev/vdb2: recovering journal
ext4 kernel [ 316.756778] EXT4-fs (vdb2): recovery complete
XFS kernel (since /etc/fstab sets / fs_passno to 0, we should only see this message with default installations) [ 372.027026] XFS (vdb3): Starting recovery (logdev: internal)
If the test case is constrained only to default installations, the messages to test for: "0x41: Dirty bit is set" "recovering journal" "XFS" and "Starting recovery"
If the test case is more broad, to account for non-default additional volumes that may not be set in fstab or may not have fs_passno set, also include: "EXT4-fs" and "recovery complete" "FAT-fs" and "Volume was not properly unmounted"
In each case I'm choosing the first message that indicates previously unclean shutdown happened. Whether fsck or kernel message, they should be fairly consistent in that I'm not expecting them to change multiple times per year. The gotcha is, how would we know? Failure to automatically parse for these messages, should they change, will indicate a clean shutdown. *shrug*
Steps 4-7: I'm not following the purpose of these steps. What I'd like to see for step 4, is, if we get a bad result (any result 2 messages), we need to collect the journal for the prior boot: `sudo journalctl -b-1 > journal.log` and attach to a bug report; or we could maybe parse for systemd messages suggesting it didn't get everything unmounted. But offhand I don't know what those messages would be, I'd have to go dig into systemd code to find them.
I think the purpose is to verify that both reboot and poweroff shut down the system correctly without any filesystem issues (which means fully committed journals and no dirty bits set).
Gotcha. Yeah, I think it's reasonable to test the LiveOS reboot as well as the installed system's reboot, to make sure they are both properly unmounting file systems.