Re: Fedora 27 kernel updates make system unbootable (sort of)

Lennart Poettering <mzerqung@xxxxxxxxxxx> · Fri, 20 Apr 2018 23:32:08 +0200

On Fr, 20.04.18 12:20, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote:

> I'm honestly mystified why the plymouth commit hasn't been reverted in
> the interim. But I'm also mystified why the bootloader folks don't
> give a shit to commit their configuration files to disk when they know
> they can't do journal replay and have known that for 20 years. But
> then I'm also mystified why systemd developers won't fallback to
> freeze/thaw if rootfs remount-ro fails three times, instead of just
> giving up and forcing a reboot, they did discuss doing this a year ago
> and then poof, no action.

Quite frankly, if you want to put the blame somewhere, I'd probably
place it with the xfs folks? I mean, there's a well-defined API on
linux for syncing a file system to disk so that it is in a clean
state, it's called sync(). Turns out that doesn't work though, it
doesn't actually do that.

systemd calls that API during shutdown if it is unable to unmount some
file system because the kernel refuses it. sysvinit did it like
that. Upstart did it that way. *Everybody* else does it that way, too.

There's FIFREEZE/FITHAW in xfs. It has a very different purpose from
syncing. It does considerably more: it also stops all further I/O to
the file system. That is seriously awful, if the process invoking it
actually runs from that very file system. But sure, we can pretend
this wasn't an issue and call mlockall() first (which we do), and
immediately FITHAW after FIFREEZE. Besides obviously being brittle and
ugly, will that be sufficient? I am not sure. Let's not forget the
actual trigger of the issue: plymouth is running but its binary was
updated while it alreadywas. The old file is now deleted but remains
pinned as long as plymouth is running. Basically as long as plymouth
is running the file system will have operations pending, hence it is
likely to get dirtied pretty soon again after that FIFREEZE/FITHAW
dance...

But then there's the other thing: in order to call FIFREEZE/FITHAW we
need to open() an fd on it. Which means actually doing disk accesses
(including possibly enqueing write accesses due to atime) actually,
and that's something we currently try hard to avoid, because doing
that on file systems that aren't healthy anymore means deadlocks. In
particular for network backed file systems this actually matters: for
them we want to issue an umount() or remount syscall, and only that,
we never want to actually access the file system, because the network
is very likely already down or otherwise unavailable. Now you might
say "but xfs is not a network file system!". That's not true
unfortunately, iscsi, nbd and all that other stacked jumble means
everything is a network file system these days.

The sync() syscall doesn't suffer by this issue. Yes, it will of
course trigger disk I/O too, that's it's whole point after
all. However, it will only do so on file systems that known dirty, and
won't generate new I/O on its own. 

There was a thread about this a while back on systemd-devel:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038615.html
(and around there).  Back than the open() issue wasn't clear to me,
this only came up recently when we worked on some NFS-related umount
work (specifically: current systemd will now implement in userspace a
time-out around mount() and sync() because of the general flakiness of
that interface).

Hence I am pretty sure FIFREEZE is really not useful. It has a
different purpose, and by using it we might make things nicer for some
but much worse for many others. In that thread I indicated I'd merge a
patch that adds it. At this point I am convinced that that would just
be a game of whack-a-mole, and we'd just make things worse on
networking fs...

Hence, I am pretty sure that xfs should fix their implementation of
sync(). I'd also be fine with calling some other API for this if they
really don't want to fix sync() — as long as it is generic, and not
some xfs specific hack. The key really here is that the API should
actually do what is needed here, i.e. no pausing of IO or so. And it
should not require us to open an fd on the file system in question.

That all said: I figure plymouth should be changed to start in the
initrd and then stick around for good, and never be updated/replaced
by any binary from the host system. That way it can't and won't keep
an files pinned from the host fs, if it's updated, as it will be
purely backed by the initrd file system. We generally require this
from storage tools, and plymouth should do the same.

Lennart

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx