Re: Is rename(2) atomic on FAT?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 23 Oct 2019 02:10:50 +0200

On Tue, Oct 22, 2019 at 12:54 PM Pali Rohár <pali.rohar@xxxxxxxxx> wrote:
>
> Hi Chris!
>
> The first question is what do you mean by "atomic". Either if is
> "atomic" at process level, that any process which access filesystem see
> consistent data at any time, or if by atomic you mean consistency of
> filesystem on underlying block device itself, or you mean atomicity at
> disk storage level.

Yeah, good question. It's a bit more complicated in reality, because
distros do things differently.

In the case of making kernel updates "atomic", it's to ensure only one
of two things happens: the old boot works or the new boot works. No
matter what, including a crash or power fail at any point during the
update. Possibly three or more files make up a "boot": kernel,
initramfs (could be more than one), and bootloader configuration. In
theory, the new kernel is written first, initramfs second, and only
once they are on stable media is the bootloader configuration file
modified, replaced, or newly written.

In the case of one kernel and initramfs, I'd have to believe no one is
doing a literal overwrite of those files (same inode). If there's a
crash or power fail, that kind of update almost certainly means an
unbootable system due to partial write of kernel or initramfs. So the
best practice for single kernel updating should be write out all new
files for kernel + initramfs, fsync, write out bootloader change to a
new file, fsync, then rename, fsync. (?)

For multiple kernels,  it doesn't matter if a crash happens anywhere
from new kernel being written to FAT, through initramfs, because the
old bootloader configuration still points to old kernel + initramfs.
But in multiple kernel distros, the bootloader configuration needs
modification or a new drop in scriptlet to point to the new
kernel+initramfs pair. And that needs to be completely atomic: write
new files to a tmp location, that way a crash won't matter. The tricky
part is to write out the bootloader configuration change such that it
can be an atomic operation.

a. write bootloader file to a temp location
b. fsync
c. mv temp final
d. fsync

if the crash happens anywhere from before a. to just after c. the old
configuration file is still present and old kernel+initramfs are used.
No problem. If the crash happens well after c. probably the new one is
in place, for sure after d. it's in place, and the new kernel+
initramfs are used.

> > According of my understanding of FAT rename() is not atomic at all.
> > It can downgrade to a hardlink. i.e. rename("foo", "bar") can result in having
> > both "foo" and "bar."
> > ...or worse.
>
> Generally rename() may really cause that at some period of time both
> "foo" and "bar" may points to same inode. (But is this a really problem
> for your scenario?)

Probably not. Either the old boot works or the new boot works.

There is a goofy thing that can happen on journaled file systems, were
file (kernel, initramfs, journalcdt) journal is updated but not normal
file system metadata, then a crash happens. In that case the
bootloader file system code can't do journal replay, and might fail to
find either old or new file intact.

>
> But looking at vfat source code (file namei_vfat.c), both rename and
> lookup operation are locked by mutex, so during rename operation there
> should not be access to read directory and therefore race condition
> should not be there (which would cause reading inconsistent directory
> during rename operation).
>
> If you want atomic rename of two files independently of filesystem, you
> can use RENAME_EXCHANGE flag. It exchanges that two specified files
> atomically, so there would not be that race condition like in rename()
> that in some period of time both "foo" and "bar" would point to same
> inode.

I'm not sure how to test the following: write kernel and initramfs to
final locations. And bootloader configuration is written to a temp
path. Then at the decision moment, rename it so that it goes from temp
path to final path doing at most 1 sector change. 1 512 byte sector
is a reasonable number to assume can be completely atomic for a
system. I have no idea if FAT can do such a 'mv' event with only one
sector change

>
>
> But... if you are asking for consistency and atomicity at filesystem
> level (e.g. you turn off disk / power supply during rename operation)
> then this is not atomic and probably it cannot be implemented. When FAT
> filesystem is mounted (either by Windows or Linux kernel) it is marked
> by "dirty" flag and later when doing unmount, "dirty" flag is cleared.

Right. And at least on UEFI and arm boards, it's not the linux kernel
that needs to read it right after a crash. It's the firmware's FAT
driver. I have no idea how they react to the dirty flag. Most distros
set /etc/fstab FS_PASSNO to 2, maybe it should be a 1, but in any case
if we boot something far enough along to get to user space fsck, the
dirty flag is cleaned up.

>
> This is there to ensure that operations like rename were finished and
> were not stopped/killed in between. So future when you read from FAT
> filesystem you would know if it is in consistent state or not.

GRUB has an option to blindly overwrite the 1024 byte contents of
grubenv (no file system modification), that's pretty close to atomic.
Most devices have physical sector bigger than 512 bytes. This write is
done in the pre-boot environment for saving state like boot counts.

And add to the mix that I guess some UEFI firmware allow writing to
FAT in the pre-boot environment? I don't know if that's universally
true. How do firmware handle a dirty bit being set? It's bad if the
firmware writes to such a file system anyway. But also bad if it can't
save state, now it's not possible to save boot attempts for fallback
purposes.

--
Chris Murphy