Re: [PATCH 3/3] xfs: freeze rw filesystems just prior to reboot

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 22 May 2017 12:01:12 +1000

On Fri, May 19, 2017 at 02:00:40PM -0700, Darrick J. Wong wrote:
> On Fri, May 19, 2017 at 01:09:31PM -0600, Chris Murphy wrote:
> > On Thu, May 18, 2017 at 4:30 PM, Darrick J. Wong
> > <darrick.wong@xxxxxxxxxx> wrote:
> > > On Thu, May 18, 2017 at 06:34:05PM +1000, Dave Chinner wrote:
> > >> On Wed, May 17, 2017 at 06:32:42PM -0700, Darrick J. Wong wrote:
> > >> > Apparently there are certain system software configurations that do odd
> > >> > things like update the kernel and reboot without umounting the /boot fs
> > >> > or remounting it readonly, either of which would push all the AIL items
> > >> > out to disk.  As a result, a subsequent invocation of something like
> > >> > grub (which has a frightening willingness to read a fs with a dirty log)
> > >> > can read stale disk contents and/or miss files the metadata for which
> > >> > have been written to the log but not checkpointed into the filesystem.
> > >>
> > >> > Granted, most of the time /boot is a separate partition and
> > >> > systemd/sysvinit/whatever actually /do/ unmount /boot before rebooting.
> > >> > This "fix" is only needed for people who have one giant filesystem.
> > >>
> > >> Let me guess the series of events: grub calls "sync" and says "I'm
> > >
> > > dpkg/rpm/systemd/CABEXTRACT.EXE/whatever, but yes :)
> > 
> > 
> > It has nothing to do with GRUB. The exact same problem would happen
> > regardless of bootloader because the thing writing out bootloader
> > configuration prior to reboot is grubby, which is not at all in any
> > way related to GRUB.

Chris, I've told you previously this is wrong (it's incorrect for
lilo) and too narrow (grubby is not the only update infrastructure
around), but I'll repeat it again because we need to move past your
single-distro focus on grubby and discuss the underlying problems.

> > I've explained this before, and now Dave's continued misstatements are
> > propagating to Darrick. If you guys want to believe in untrue things,
> > have at it. But please stop repeating untrue things.
> 
> I need to clarify here what I meant by
> "dpkg/rpm/systemd/CABEXTRACT.EXE/whatever, but yes". is that the thing
> that writes to the filesystem prior to the reboot -- the package manager
> and the boot configuration file writer.  Not the grub stage1/2/3, not
> the lilo binary, not any of the bootloaders.

> I won't speak for Dave but
> I think you and I are roughly on the same page here.

Mostly. If you s/grub/grub update infrastructure/ in what I wrote,
then the first aspect of the description is the same.

However, IMO problem does indeed lie with the bootloader and not the
distro packaging mechanism/scripts, and so we need to talk a bit out
the architectural differences between bootloaders like lilo and
grub/petitboot and why I consider update durability to be something
the bootloader needs to provide, not thrid party packagers or the
init system.

> > Now originally they were blaming the file systems, saying that sync()
> > is supposed to guarantee everything, data and metadata is completely
> > written to stable media. But I think that definition of sync()
> > predates journaled file systems, so now there's broad understanding in
> > fs circles that journaled file systems only guarantee the journal and
> > data are committed to stable media, not the fs metadata itself.
> 
> Right.  Everything's committed to stable media /somewhere/, but the
> metadata isn't necessarily where an fs driver will find it if that
> driver ignores the dirty log.

Right. ISTR that early FS journalling papers from the late 80s
talked about this in detail, and why they didn't need to write back
metadata to provide sync(1) guarantees.  All journalling filesystems
since then have simply written the data+journal on sync(1).  These
sorts of basic OS concepts should be known by *everyone* writing low
level OS infrastructure...

> > This whole LILO thing is irritating. I don't know how many times I
> > have to say it...
> > 
> > grubby is the sole thing responsible for writing bootloader
> > configuration changes, no matter the bootloader, on Red Hat and Fedora
> > systems. There is absolutely no difference between LILO and GRUB
> > bootloader configuration changes on these distros.
> 
> I'm tired of getting beat over the head about this.  There's no grubby
> on Debian!

I've told Chris that several times, too. He's still shouting at me
(and you!) about grubby, though...

> (No *sync(), huh?)
...
> (we only fsync the grubenv file??)

Yikes! It's worse than I thought.

So, with that all out of the way, lets look at a lilo update.
My laptop (debian) running lilo:

$ sudo vi /etc/lilo.conf
<modify and save new config file to simulate dpkg installing a new
 kernel package, updating /boot and various links and modifying the
 /etc/lilo.conf file>
$ sudo strace -o t.t lilo
Added Linux  *
Added Linux.old
$ grep sync t.t
sync()                                  = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
fdatasync(6)                            = 0
sync()                                  = 0
sync()                                  = 0
$ 

The files decriptors of note are:

open("/dev/disk/by-id/ata-C400-MTFDDAK256MAM_000000001305036641EA", O_RDWR) = 5
open("/boot/map~", O_RDWR)              = 6

And here's the last ~10 operations done:

write(6, "\353N\0\0\0\0LILO\30\2\3360\"Y\2\3\0\0\266\0\0\0\0\0\225;\340V\2\0", 32) = 32
close(6)                                = 0
lseek(5, 0, SEEK_SET)                   = 0
sync()                                  = 0
write(5, "\372\353!\1\264\1LILO\30\2\3360\"Y\0\0\0\0\303\3049Q\224\236\7\0\241\0\200`"..., 512) = 512
close(5)                                = 0
rename("/boot/map~", "/boot/map")       = 0
sync()                                  = 0
close(4)                                = 0
exit_group(0)

This says that lilo uses a safe overwrite proceedure to do the
on-disk bootloader update. i.e. the initial sync ensures all the new
kernel updates and config file mods are written to disk. It then
maps all the files and writes the /boot/map~ file and syncs
everything. At this point, the new boot information is on the block
device. It then writes the new boot sector, renames /boot/map~ to it's
proper name, then runs sync. This atomic update of the boot sector
switches to the new boot map and makes the switch permanent.

Hence when the lilo binary exits, the information required by the
bootloader is *completely on stable storage*. At each stage, right
up to the atomic boot sector overwrite, a crash or failure leaves
the old boot map and boot sector intact. IOWs, lilo has a *fail-safe
update procedure*.

As a result, you do *not* need to call sync or freeze the filesystem
after running the lilo command because it has already guaranteed the
updated bootloader information is on stable storage.  And because
you have to run lilo to update the bootloader, distro package
managers *can't fuck it up* because the bootloader has full control
of the update process.

IOWs, Lilo updates are based on a fool-proof design.  It's also
fail-safe on many levels because not only is the update mechanism
fail-safe, config errors/failures will be caught during update
rather before the system is restarted rather than when the config
files are parsed during bootloader execution when it errors may be
difficult/impossible to fix.

Grub, unfortunately, does not have a fail-safe update procedure, and
it clearly does not have a fool-proof update architecture (as this
thread demonstrates). If we expect distro packaging tools to do
reliable boot loader updates, then the bootloader needs to provide
both fail-safe updates and do it via a fool-proof mechanism. It's no
good having a fail-safe update mechanism if it does not get used to
do updates. Grub currently fails on both accounts....

> > So to be consistent you have to blame RPM for not ensuring its writes
> > are safely on stable storage either, before reboot.
> > 
> > Jesus Christ...
> > 
> > I want Dave to write "this problem has nothing to do with GRUB" 50
> > times on a chalkboard.
> 
> Frankly, all you have to do is tweak the grub config file writer to
> "fsfreeze -f /boot && fsfreeze -u /boot" after it writes the grub.cfg and
> before it exits.  That's both grubby and grub-mkconfig.

Yup, that's the work-around I've previously suggested should be
added to grub and distro update mechanisms.  It doesn't fix the
underlying architectural issues grub has (e.g. updates are still
not fail-safe), but it provides the durability guarantee that is
required if the whole update process completes.

> Or systemd/sysvinit could always ensure that the fs containing the boot
> files is unmounted no matter what.  Or systemd/sysvinit could freeze the
> fs for us before asking for a reboot.  Or the kernel could freeze the fs
> for us before actually doing the reboot.  Any one of those options is
> sufficient, but not one of them is reliably performed by any of the
> pieces.  systemd *almost* always succeed at unmounting /boot, which is
> why few people see this problem.

Design principles of modularity and coupling say it's really up to
the bootloader to provide the durability mechanisms it needs.  IOWs,
the *bootloader update infrastructure* needs to provide the
durability guarantee/mechanism the bootloader infrastructure
requires (as per the lilo example) and not the init system.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html