On Wed, Jul 12, 2023 at 05:06:31PM +0800, zhanchengbin wrote: > > ...at a cost of racing with the mounted fs, which might be updating the > > superblock at the same time; and prohibiting the kernel devs from > > closing the "scribble on mounted bdev" attack surface. > > Regardless of whether I am modifying a single byte or the entire > buffer_head, there will always be a situation of contention with the kernel > lock, You can take a look at ext4_update_superblocks_fn which calls > lock_buffer. Many/most of the fields that tune2fs will need to modify are ones which the kernel never needs to modify. So no locking will be necessary, and so long as you are using the journal, we don't need to worry about an invalid checksum getting written to the disk. There might be races with buffered reads to the superblock, but those races exist today. E2fsprogs has a way of dealing this where if the checksum is invalid, it will just sleep and retry. Another way of dealing with is to use an O_DIRECT read to the superblock. Longer-term, we may want to have a EXT4_IOC_GET_SUPERBLOCK ioctl, at which point we may need to have some kind of new kernel locking ---- or we can just take a snapshot of the superblock, and check to see the checksum is valid; if not, it can just retry the snapshot. > What I am more concerned about is that the superblock needs to be > synchronized to the memory before being saved to the disk. Otherwise, during > the ext4_commit_super process, outdated data may be saved to the disk. As I've noted above, this is already not a problem if journalling is enabled. If journalling is not enabled, it's possible that an invalid superblock is written to disk. This can sort of happen already, if you have an orphan list update racing with an ext4_error() update to the superblock. It's a debateable point how much we should care, since if you don't have journalling enabled, on a crash the file system may be corrupted in many situations, and so we will need to use fsck.ext4 anyway. If there is only a single superblock, then fsck might not have a fallback superblock to use, but arguably that's a corner case. > The program perceives that the superblock has been modified > successfully, and the value of the modified superblock is saved on > disk in ext4_update_primary_sb, but there is no guarantee whether > the super block is saved in journal on the disk or whether it is > checkpointed. So in actual practice, e2fsprogs will replay the journal and then reread the superblock. So if you change the max_mounts_count via tune2fs -c (even though very few systems use that these days, and it's largely a deprecated feature), so long as the transaction has been committed, the superblock update will be honored by e2fsck. I'm not sure we care *all* that much, but if we really want to make sure things like tune2fs -c will always "take" after a crash, we can simply have the ioctl force a journal commit before returning. > If the super block has not been saved in journal on > the disk and the system crashes, the modification of the super block > may be overwritten when the journal recover; similarly, this problem > will also occur for the translation that has not been checkpointed; > Both of these scenarios are not perceptible to user process unless > there is a backup mechanism implemented in user mode. We now *always* update the superblock through the journal. We only fall back to a direct update of the superblock if the journal is not present, or the journal has aborted due to some kind of fundamental failure (so that the ext4_error can be written out before we reboot the system or force the file system to be read-only). > Moreover, the method of rerunning the program cannot resolve the conflicting > racing condition between the two ioctls. These ioctls are quite rarely used, and it's a problem we have today if we have two racing tune2fs commands. By having the kernel handle it, so long as the two ioctls are modifying different superblock fields, both ioctl updates will happen --- where as today, when we have two racing tune2fs, one of the tune2fs updates could be completely lost. This has been true for decades in ext2, ext3, and ext4, and no one has actually reported this as a problem. Cheers,