Re: Helped wanted with tricky potential F41 blocker

Eric Sandeen <sandeen@xxxxxxxxxx> · Fri, 18 Oct 2024 17:10:09 -0500

On 10/18/24 4:42 PM, Adam Williamson wrote:
> Hey folks! I'm sending up a flare for help with a potential F41 blocker
> that looks pretty tricky. It is
> https://bugzilla.redhat.com/show_bug.cgi?id=2318710 .
> 
> The problem is fairly easy to reproduce. Install Fedora 40 or 41 Beta
> with an ext4 root partition, take a snapshot (for convenience in
> testing), then do an offline upgrade to current F41 (or offline update
> any one of a specific list of packages that triggers the issue, which
> Kamil Paral worked out - see
> https://bugzilla.redhat.com/show_bug.cgi?id=2318710#c14 ). On the boot
> after the offline upgrade runs, you'll drop to emergency mode, with the
> system complaining about 'ext4 bad orphan inode' issues. But if you
> just reboot from this state, the system will then boot up fine.
> 
> This only seems to happen on ext4, it's not affecting installs to xfs
> or btrfs. But we suspect there are still quite a few people out there
> with their root partition on ext4, so we're worried this might have to
> block the release.
> 
> It's a pretty odd bug. We can't see anything much in common between the
> packages that trigger it - no files in weird places, no odd scripts.
> The failure case itself is pretty weird. Fabio had a good theory that
> it might be caused by the rpm-plugin-ima package, but sadly testing I
> did today indicates that is not the case.
> 
> If anyone has any bright ideas what might be going on here, please do
> reply or add them to the bug! Thanks.

Hm, for starters, from the bug:

> The logs contain:
> 
> systemd-fsck[489]: /dev/vda3: recovering journal
> systemd-fsck[489]: /dev/vda3: Clearing orphaned inode 295083 (uid=0, gid=0, mode=0100755, size=60800)
...

Why does the root filesystem require recovery at all? Why was root not
cleanly unmounted / remounted readonly on the prior reboot? Might be worth
looking at the reboot logs before this boot error.

But then ...

> kernel: EXT4-fs (vda3): orphan cleanup on readonly fs
> kernel: EXT4-fs error (device vda3): ext4_orphan_get:1421: comm mount: bad orphan inode 295083
> kernel: ext4_test_bit(bit=170, block=1048596) = 0

Ok, well, fsck *just said* it had cleared that inode. :/ 

> Could the issue lie with pk-offline-update? Seems like it is rebooting too quickly
> after the packages are updated; before the filesystem is stable.

Not sure what that means, but hints at my "why is the journal being replayed? why was the
root fs not quiesced on the reboot?" question above.

> Of course the journal shouldn't need to be recovered in the first place

correct ...

Can anyone get a metadata image (e2image -Q /path/to/root/device image.qcow2) post-upgrade,
before reboot tries to run fsck and fix things?

I can try to get to reproducing this but if it's easy for anyone else, please make that
e2image, compress it, and stick it on the bug if it fits.

Upgrade failing to properly reboot the system (leaving the root fs in a state that
needs recovery) may be the core problem here, but that said, fsck and/or log recovery
should still yield a consistent filesystem even in that case, and apparently it is not.

Thanks,
-Eric

-- 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue