Re: Fedora 27 kernel updates make system unbootable (sort of)

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Fri, 20 Apr 2018 16:05:09 -0600

On Fri, Apr 20, 2018 at 3:32 PM, Lennart Poettering
<mzerqung@xxxxxxxxxxx> wrote:
> On Fr, 20.04.18 12:20, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote:
>
>> I'm honestly mystified why the plymouth commit hasn't been reverted in
>> the interim. But I'm also mystified why the bootloader folks don't
>> give a shit to commit their configuration files to disk when they know
>> they can't do journal replay and have known that for 20 years. But
>> then I'm also mystified why systemd developers won't fallback to
>> freeze/thaw if rootfs remount-ro fails three times, instead of just
>> giving up and forcing a reboot, they did discuss doing this a year ago
>> and then poof, no action.
>
> Quite frankly, if you want to put the blame somewhere, I'd probably
> place it with the xfs folks? I mean, there's a well-defined API on
> linux for syncing a file system to disk so that it is in a clean
> state, it's called sync(). Turns out that doesn't work though, it
> doesn't actually do that.

No we've already been over this a year ago. That well defined API
predated journaled file systems, it's only a guarantee on
non-journaled file systems. Requiring this for journaled file systems
would break them by making them insanely slow for every fsync. It
would totally obviate the whole point of journaling.

> systemd calls that API during shutdown if it is unable to unmount some
> file system because the kernel refuses it. sysvinit did it like
> that. Upstart did it that way. *Everybody* else does it that way, too.

The systemd sync() is sufficient for pretty much everything except the
bootloader stuff. Log replay during boot will make everything
consistent again. So while I appreciate the idea of systemd doing a
freeze/thaw and I'm not totally opposed to the idea, I think the
burden here is really on grubby and GRUB first and foremost because
they are the ones who can't depend on either fsck or log replay for
their changes to be properly on disk.

The alternative, as Eric Sandeed suggests in the bug report, is a
compulsory /boot on ext2 (or maybe ext4 with journal disabled) which
then makes sync() fully flush to stable media, as there is no log.
*shrug* We do this weird thing with persistently mounted /boot and
this exposes that file system to unclean shutdowns too, so now there's
an obligatory fsck which the bootloader can't do either. I'm not
completely certain that a no log file system on /boot is really the
way forward - not least of which it amounts to "we need you to
reinstall your system to fix this problem, which by the way also
requires a new partitioning layout from what you've been using"

> There's FIFREEZE/FITHAW in xfs. It has a very different purpose from
> syncing. It does considerably more: it also stops all further I/O to
> the file system. That is seriously awful, if the process invoking it
> actually runs from that very file system. But sure, we can pretend
> this wasn't an issue and call mlockall() first (which we do), and
> immediately FITHAW after FIFREEZE. Besides obviously being brittle and
> ugly, will that be sufficient?

Nope, because a crash, hang, user forces power off, can happen well
before the presumably last second freeze thaw that systemd would
issue.

>I am not sure. Let's not forget the
> actual trigger of the issue: plymouth is running but its binary was
> updated while it alreadywas. The old file is now deleted but remains
> pinned as long as plymouth is running. Basically as long as plymouth
> is running the file system will have operations pending, hence it is
> likely to get dirtied pretty soon again after that FIFREEZE/FITHAW
> dance...

Fair point. But I also note that it's exempting itself from being
clobbered by systemd contrary to:

Again: if your code is being run from the root file system, then this
logic suggested above is NOT for you. Sorry. Talk to us, we can
probably help you to find a different solution to your problem.
https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/

And while that is an instigator in this problem, I still don't think
logically this absolves grubby and GRUB from fully committing their
changes before reboot is even called.

> But then there's the other thing: in order to call FIFREEZE/FITHAW we
> need to open() an fd on it. Which means actually doing disk accesses
> (including possibly enqueing write accesses due to atime) actually,
> and that's something we currently try hard to avoid, because doing
> that on file systems that aren't healthy anymore means deadlocks. In
> particular for network backed file systems this actually matters: for
> them we want to issue an umount() or remount syscall, and only that,
> we never want to actually access the file system, because the network
> is very likely already down or otherwise unavailable. Now you might
> say "but xfs is not a network file system!". That's not true
> unfortunately, iscsi, nbd and all that other stacked jumble means
> everything is a network file system these days.
>
> The sync() syscall doesn't suffer by this issue. Yes, it will of
> course trigger disk I/O too, that's it's whole point after
> all. However, it will only do so on file systems that known dirty, and
> won't generate new I/O on its own.
>
> There was a thread about this a while back on systemd-devel:
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038615.html
> (and around there).  Back than the open() issue wasn't clear to me,
> this only came up recently when we worked on some NFS-related umount
> work (specifically: current systemd will now implement in userspace a
> time-out around mount() and sync() because of the general flakiness of
> that interface).
>
> Hence I am pretty sure FIFREEZE is really not useful. It has a
> different purpose, and by using it we might make things nicer for some
> but much worse for many others. In that thread I indicated I'd merge a
> patch that adds it. At this point I am convinced that that would just
> be a game of whack-a-mole, and we'd just make things worse on
> networking fs...

OK.

>
> Hence, I am pretty sure that xfs should fix their implementation of
> sync(). I'd also be fine with calling some other API for this if they
> really don't want to fix sync() — as long as it is generic, and not
> some xfs specific hack. The key really here is that the API should
> actually do what is needed here, i.e. no pausing of IO or so. And it
> should not require us to open an fd on the file system in question.

>From my conversations with fs devs, this is not an XFS problem. It can
happen on ext4 as well, it just seems to flush its log a lot faster
than XFS. And what I recall about Btrfs is that it manifests
differently, the bootloader ends up loading a stale bootloader
configuration since everything is copy on write, and thus it boots the
previous kernel, and during boot the log tree updates the fs metadata,
and the second boot uses the new bootloader config and the new kernel
is booted. I have no idea how f2fs behaves.

Anyway, I remain unconvinced this is an XFS specific problem, it's
just a lot easier to reproduce on XFS.

> That all said: I figure plymouth should be changed to start in the
> initrd and then stick around for good, and never be updated/replaced
> by any binary from the host system. That way it can't and won't keep
> an files pinned from the host fs, if it's updated, as it will be
> purely backed by the initrd file system. We generally require this
> from storage tools, and plymouth should do the same.

That would fix the problem in the broad case, because it would allow
systemd remount-ro to succeed, which then fully flushes /boot changes
to disk rather than just to the log. But in the narrower case of
crash, hang, or forced poweroff in between /boot modification and
successful remount-ro it actually doesn't help. The user is screwed.
So I still think that the last thing to modify /boot is duty bound to
commit changes to disk - even if it has to hang for 1 minute, which
would allow the file system to commit changes on its own.

I have no idea if what you've learned about freeze thaw limitations
apply to grubby and grub-mkconfig and rpm-ostree doing it? Right now a
part of grubby, new-kernel-pkg calls freeze thaw but only on PPC64LE.
So extending this to happen on all archs whenever /boot is on XFS,
ext4, and Btrfs seems a lot easier than most any other single thing
I've come across so far.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx