Re: For today's agenda item - Proposed disk unmount test case

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 28 Nov 2019 12:28:34 -0700

On Thu, Nov 28, 2019 at 2:30 AM Kamil Paral <kparal@xxxxxxxxxx> wrote:
>
> On Wed, Nov 27, 2019 at 9:17 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>>
>> A fair point about VM testing is whether the disk cache mode affects
>> the outcome. I use unsafe because it's faster and I embrace misery. I
>> think QA bots are now mostly using unsafe because it's faster too. So
>> depending on the situation it may be true that certain corruptions are
>> expected if unsafe is used, but I *think* unsafe is only unsafe in the
>> event the host crashes or experiences a power failure. I do forced
>> power offs of VMs all the time and never lose anything, in the case of
>> ext4 and XFS, journal replay always makes the file system consistent
>> again. And journal replay in that example is expected, not a bug.
>
>
> By "that example", do you mean the story you just described, or the "bad result example" from the test case?

The story, which is too verbose and also confusing, so just ignore it. :-D

> Because in that test case example, if the machine was correctly powered off/rebooted, there should be no reason to reply journal or see dirty bits.

That should be true, yes.

>
>>
>>
>> How to test, step 2 and 3:
>> This only applies to FAT and ext4. XFS and Btrfs have no fsck, both
>> depend on log replay if there was an unclean shutdown. Also, there are
>> error messages for common unclean shutdowns, and error messages for
>> uncommon problems. I think we only care about the former, correct?
>
>
> I believe so. Is there a tool that could tell us whether the currently mounted drives were mounted cleanly, or some error correction had to be performed? Because this is quickly getting into territory where we will need to provide a large amount of examples and rely on people parsing the output, comparing and (mis-)judging. The wording in journal output can change any time as well. And I don't really like that.

FAT, ext4, and XFS all have a kind of "dirty bit" set upon mount. It's
removed when cleanly unmounted. Therefore if the file system isn't
mounted, but the "dirty bit" is set, it can be assumed it was not
cleanly unmounted. Both kernel code and each file system's fsck can
detect this, and the message you see depends on which discovers the
problem first. The subsequent messages about how this problem is
handled, I think we can ignore. As you say, it will be variable. All
we care about is the indicator that it was not properly unmounted.
Here are those indicators for each file system:

FAT fsck (since /etc/fstab sets EFI system partition fs_passno to 2,
this is what's displayed for default installations)
Nov 28 12:04:21 localhost.localdomain systemd-fsck[681]: 0x41: Dirty
bit is set. Fs was not properly unmounted and some data may be
corrupt.

FAT kernel
[  205.317346] FAT-fs (vdb1): Volume was not properly unmounted. Some
data may be corrupt. Please run fsck.

ext4 fsck (since /etc/fstab sets /, /boot, /home fs_passno to 1 or 2,
this is what's displayed for default installations)
Nov 28 12:07:21 localhost.localdomain systemd-fsck[681]: /dev/vdb2:
recovering journal

ext4 kernel
[  316.756778] EXT4-fs (vdb2): recovery complete

XFS kernel (since /etc/fstab sets / fs_passno to 0, we should only see
this message with default installations)
[  372.027026] XFS (vdb3): Starting recovery (logdev: internal)

If the test case is constrained only to default installations, the
messages to test for:
"0x41: Dirty bit is set"
"recovering journal"
"XFS" and "Starting recovery"

If the test case is more broad, to account for non-default additional
volumes that may not be set in fstab or may not have fs_passno set,
also include:
"EXT4-fs" and "recovery complete"
"FAT-fs" and "Volume was not properly unmounted"

In each case I'm choosing the first message that indicates previously
unclean shutdown happened. Whether fsck or kernel message, they should
be fairly consistent in that I'm not expecting them to change multiple
times per year. The gotcha is, how would we know? Failure to
automatically parse for these messages, should they change, will
indicate a clean shutdown. *shrug*

>> Steps 4-7: I'm not following the purpose of these steps. What I'd like
>> to see for step 4, is, if we get a bad result (any result 2 messages),
>> we need to collect the journal for the prior boot: `sudo journalctl
>> -b-1 > journal.log` and attach to a bug report; or we could maybe
>> parse for systemd messages suggesting it didn't get everything
>> unmounted. But offhand I don't know what those messages would be, I'd
>> have to go dig into systemd code to find them.
>
>
> I think the purpose is to verify that both reboot and poweroff shut down the system correctly without any filesystem issues (which means fully committed journals and no dirty bits set).

Gotcha. Yeah, I think it's reasonable to test the LiveOS reboot as
well as the installed system's reboot, to make sure they are both
properly unmounting file systems.

-- 
Chris Murphy
_______________________________________________
test mailing list -- test@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to test-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/test@xxxxxxxxxxxxxxxxxxxxxxx