On Tue, Nov 19, 2024 at 09:49:57AM -0800, Theodore Ts'o wrote: > Yes, this can happen if the file system is mounted. The reason for > this is that the kernel updates metadata blocks via the block buffer > cache, with the jbd2 (journaled block layer v2) subsystem managing the > atomic updates. The jbd2 layer will block buffer cache writebacks > until the changes are committed in a jbd2 transaction. So the version > on disk is guaranteed to be consistent. > > However, a buffer cache read does not have any consistency guarantees, > and if the file system is being actively modified, it is possible that > you could a superblock where the checksum hasn't yet been updated. > > The O_DIRECT read isn't a magic bullet. For example, if you have a > scratch file system which is guaranteed not to survive a Kubernetes or > Borg container getting aborted, you might decide to format the file > system without a jbd2 journal, since that would be more efficient, and > by definition you don't care about the contents of the file system > after a crash. So there are millions of ext4 file systems in > hyperscale computing environments that are created without a journal; > and in that case, O_DIRECT will not be sufficient for guaranteeing a > consistent read of the superblock. Thanks for the additional detail on jbd2's involvement. When I originally encountered this, it was on a 5.15 kernel where ext4_commit_super() was still using mark_buffer_dirty() prior to submitting the IO for the superblock write. I had managed to convince myself that ext4_commit_super() holding the BH_lock combined with O_DIRECT waiting for the dirty buffers associated with the superblock to get written was sufficient to get a consistent read of the superblock. I missed that this was changed as part of another bugfix[1]. The version of this fix that you applied for resize2fs has resulted in no re-occurence of the problem in the environments where we had been previously encountering the problem. With libblkid, it's resulted in systemd-udevd removing /dev/disk/by-label and /dev/disk/by-uuid links for devices when the superblock checksum can't be read. This in turn has resulted in /boot failing to mount (when it's on a separate filesystem), update-grub calls failing because /boot isn't mounted, and we recently had a mkinitramfs fail because the /dev/disk/by-uuid links were missing for the root device. The patch I sent has resolved the problems in our production environments, and was also run through a battery of synthetic boot tests. We've seen no re-occurence with it applied. I've also run the change against the util-linux unit tests and observed no regressions. I included systemd-devel on this in case other users were observing disappearing /dev/disk/ links. I hoped I might save somebody else from having to debug this a second time. -K [1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@xxxxxxxxxx/