Re: [bug report] tune2fs: filesystem inconsistency occurs by concurrent write

"Theodore Ts'o" <tytso@xxxxxxx> · Wed, 12 Jul 2023 11:42:49 -0400

On Wed, Jul 12, 2023 at 05:06:31PM +0800, zhanchengbin wrote:
> > ...at a cost of racing with the mounted fs, which might be updating the
> > superblock at the same time; and prohibiting the kernel devs from
> > closing the "scribble on mounted bdev" attack surface.
> 
> Regardless of whether I am modifying a single byte or the entire
> buffer_head, there will always be a situation of contention with the kernel
> lock, You can take a look at ext4_update_superblocks_fn which calls
> lock_buffer.

Many/most of the fields that tune2fs will need to modify are ones
which the kernel never needs to modify.  So no locking will be
necessary, and so long as you are using the journal, we don't need to
worry about an invalid checksum getting written to the disk.

There might be races with buffered reads to the superblock, but those
races exist today.  E2fsprogs has a way of dealing this where if the
checksum is invalid, it will just sleep and retry.  Another way of
dealing with is to use an O_DIRECT read to the superblock.

Longer-term, we may want to have a EXT4_IOC_GET_SUPERBLOCK ioctl, at
which point we may need to have some kind of new kernel locking ----
or we can just take a snapshot of the superblock, and check to see the
checksum is valid; if not, it can just retry the snapshot.

> What I am more concerned about is that the superblock needs to be
> synchronized to the memory before being saved to the disk. Otherwise, during
> the ext4_commit_super process, outdated data may be saved to the disk.

As I've noted above, this is already not a problem if journalling is
enabled.  If journalling is not enabled, it's possible that an invalid
superblock is written to disk.  This can sort of happen already, if
you have an orphan list update racing with an ext4_error() update to
the superblock.  It's a debateable point how much we should care,
since if you don't have journalling enabled, on a crash the file
system may be corrupted in many situations, and so we will need to use
fsck.ext4 anyway.  If there is only a single superblock, then fsck
might not have a fallback superblock to use, but arguably that's a
corner case.

> The program perceives that the superblock has been modified
> successfully, and the value of the modified superblock is saved on
> disk in ext4_update_primary_sb, but there is no guarantee whether
> the super block is saved in journal on the disk or whether it is
> checkpointed.

So in actual practice, e2fsprogs will replay the journal and then
reread the superblock.  So if you change the max_mounts_count via
tune2fs -c (even though very few systems use that these days, and it's
largely a deprecated feature), so long as the transaction has been
committed, the superblock update will be honored by e2fsck.

I'm not sure we care *all* that much, but if we really want to make
sure things like tune2fs -c will always "take" after a crash, we can
simply have the ioctl force a journal commit before returning.

> If the super block has not been saved in journal on
> the disk and the system crashes, the modification of the super block
> may be overwritten when the journal recover; similarly, this problem
> will also occur for the translation that has not been checkpointed;
> Both of these scenarios are not perceptible to user process unless
> there is a backup mechanism implemented in user mode.

We now *always* update the superblock through the journal.  We only
fall back to a direct update of the superblock if the journal is not
present, or the journal has aborted due to some kind of fundamental
failure (so that the ext4_error can be written out before we reboot
the system or force the file system to be read-only).

> Moreover, the method of rerunning the program cannot resolve the conflicting
> racing condition between the two ioctls.

These ioctls are quite rarely used, and it's a problem we have today
if we have two racing tune2fs commands.  By having the kernel handle
it, so long as the two ioctls are modifying different superblock
fields, both ioctl updates will happen --- where as today, when we
have two racing tune2fs, one of the tune2fs updates could be
completely lost.  This has been true for decades in ext2, ext3, and
ext4, and no one has actually reported this as a problem.

Cheers,