Re: Is rename(2) atomic on FAT?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 23 Oct 2019 16:21:19 +0200

On Wed, Oct 23, 2019 at 1:50 PM Pali Rohár <pali.rohar@xxxxxxxxx> wrote:
>
> Hi!
>
> On Wednesday 23 October 2019 02:10:50 Chris Murphy wrote:
> > a. write bootloader file to a temp location
> > b. fsync
> > c. mv temp final
> > d. fsync
> >
> > if the crash happens anywhere from before a. to just after c. the old
> > configuration file is still present and old kernel+initramfs are used.
> > No problem. If the crash happens well after c. probably the new one is
> > in place, for sure after d. it's in place, and the new kernel+
> > initramfs are used.
>
> I do not think that kernel guarantee for any filesystem that rename
> operation would be atomic on underlying disk storage.
>
> But somebody else should confirm it.

I don't know either or how to confirm it. But, being ignorant about a
great many things, my instinct is literal fsync (flush buffer to disk)
should go away at the application level, and fsync should only be used
to indicate write order and what is part of a "commit" that is to be
atomic (completely succeeds or fails). And of course that can only be
guaranteed as far as the kernel is concerned, it doesn't guarantee
anything about how the hardware block device actually behaves (warts
bugs and all).

Anyway it made me think of this:
https://lwn.net/Articles/789600/

> So if kernel crashes in the middle of c or between c and d you need to
> repair filesystem externally prior trying to boot from such disk.

Nice in theory, but in practice the user simply reboots, and screams
WTF outloud if the system face plants. And people wonder why things
are still broken 20 years later with all the same kinds of problems
and prescriptions to boot off some rescue media instead of it being
fail safe by design. It's definitely not fail safe to have a kernel
update that could possibly result in an unbootable system. I can't
think of any ordinary server, cloud, desktop, mobile user who wants to
have to boot from rescue media to do a simple repair. Of course they
all just want to reboot and have the right thing always happen no
matter what, otherwise they get so nervous about doing updates that
they postpone them longer than they should.

> > I'm not sure how to test the following: write kernel and initramfs to
> > final locations. And bootloader configuration is written to a temp
> > path. Then at the decision moment, rename it so that it goes from temp
> > path to final path doing at most 1 sector change. 1 512 byte sector
> > is a reasonable number to assume can be completely atomic for a
> > system. I have no idea if FAT can do such a 'mv' event with only one
> > sector change
>
> Theoretically it could be possible to implement it for FAT (with more
> restrictions), but I doubt that general purpose implementation of any
> filesystem in kernel can do such thing. So no practically.

Now I'm wondering what the UEFI spec says about this, and whether this
problem was anticipated, and how surprised I should be if it wasn't
anticipated.

>
> > >
> > >
> > > But... if you are asking for consistency and atomicity at filesystem
> > > level (e.g. you turn off disk / power supply during rename operation)
> > > then this is not atomic and probably it cannot be implemented. When FAT
> > > filesystem is mounted (either by Windows or Linux kernel) it is marked
> > > by "dirty" flag and later when doing unmount, "dirty" flag is cleared.
> >
> > Right. And at least on UEFI and arm boards, it's not the linux kernel
> > that needs to read it right after a crash. It's the firmware's FAT
> > driver. I have no idea how they react to the dirty flag.
>
> Those bootloader firmwares which just load & run bootloader practically
> do not write anything to that FAT filesystem. In most cases their
> implementation of FAT is read-only and very stupid. I doubt that there
> is check for dirty flag.
>
> I saw lot of commercial devices of different kind which can read & write
> (backup) data to (FAT) SD card. And lot of time they were not able to
> read FAT filesystem formatted by other tool, only by their (or by
> in-device FAT formatted).
>
> So such firmwares can be full of bugs and it really is not a good idea
> to try booting bootloader from inconsistent FAT filesystem.

Right. I've had quite a bit of experience with this too, but lately I
think my experience is actually chock full of noisy data and what I
thought I was seeing, might not actually be what I was seeing.

Since ancient times in digital photography and video, it's been
considered widely that the camera firmware's FAT driver is crap, and
often corrupts the flash media, in particular when doing things like
individual image file deletes, or exchanging cards between unlike
cameras (make or model). As it turns out, this narrative is mostly
pushed by the flash media vendors.

Fast forward to the advent of cheap ARM boards and even Intel NUC type
computers, and people experiencing various kinds of corruption with
consumer name brand SD cards. The more generic, the more likely the
card goes suddenly read only forever. But even the name brand cards
I've used in an Intel NUC have had this happen, being replaced without
complaint by the manufacturer under warranty, yet it still keeps on
happening. Then found HN threads about this and people saying, yeah
you have to use industrial flash cards for this purpose, totally
solves the problem. And voila, there's enough anecdotal data out there
that really it's consumer flash being super sensitive to power cuts.

It may in fact have never had a thing to do with crap file system drivers.

> > GRUB has an option to blindly overwrite the 1024 byte contents of
> > grubenv (no file system modification), that's pretty close to atomic.
> > Most devices have physical sector bigger than 512 bytes. This write is
> > done in the pre-boot environment for saving state like boot counts.
>
> This depends on grub's FAT implementation. As said I would be very
> careful about such "atomic" writes. There are also some caches, include
> hardware on-disk, etc...

GRUB doesn't use any file system driver for writes, ever. It uses a
file system driver only to find out what two LBAs the "grubenv"
occupies, and then blindly overwrites those two sectors to save state.
There is no file system metadata update at all.

>
> > And add to the mix that I guess some UEFI firmware allow writing to
> > FAT in the pre-boot environment?
>
> Yes, UEFI API allows you to write to disk devices. And UEFI fileystem
> implementation can also supports writing to FAT fs.
>
> > I don't know if that's universally true. How do firmware handle a dirty bit being set?
>
> Bad implementation would ignore it. This is something which you should
> expect.

Maybe a project for someone is to bake xfstests into an EFI program so
we can start testing these firmware FAT drivers and see what we learn
about how bad they are?

-- 
Chris Murphy