[GIT PULL] super updates

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 24 Aug 2023 16:41:04 +0200

Hey Linus,

/* Summary */
This pull request contains the super rework that was ready for this
cycle. The first part changes the order of how we open block devices and
allocate superblocks, contains various cleanups, simplifications, and a
new mechanism to wait on superblock state changes.

This unblocks work to ultimately limit the number of writers to a block
device. Jan has already scheduled follow-up work that will be ready for
v6.7 and allows us to restrict the number of writers to a given block
device. That series builds on this work right here.

The second part contains filesystem freezing updates.

Tree Organization
=================

The filesystem freezing work was done by Darrick and he provided me with
two tags to pull from. So this pull request right here contains two
subtrees. The generic superblock changes are first. The two filesystem
freezing related tags are merged into it:

* The 'vfs-6.6-merge-2' tag brings in the generic filesystem freezing
  changes. Resulting merge conflicts have been resolved.

* The 'vfs-6.6-merge-3' brings in the xfs changes that make use of the
  new filesystem freezing changes.

Should you decide to not pull this then Darrick can provide you with the
two tags for the filesystem freezing changes separately.

A few changes, including Darrick's pull requests have been pulled in
only fairly recently but they have all been in -next.

Overview
========

The generic superblock changes are rougly organized as follows (skipping
over additional minor cleanups):

(1) Removal of the bd_super member from struct block_device.
    This was a very odd back pointer to struct super_block with unclear
    rules. For all relevant places we have other means to get the same
    information so just get rid of this.
(2) Simplify rules for superblock cleanup.
    Roughly, everything that is allocated during fs_context
    initialization and that's stored in fs_context->s_fs_info needs to
    be cleaned up by the fs_context->free() implementation before the
    superblock allocation function has been called successfully.

    After sget_fc() returned fs_context->s_fs_info has been transferred
    to sb->s_fs_info at which point sb->kill_sb() if fully responsible
    for cleanup. Adhering to these rules means that cleanup of
    sb->s_fs_info in fill_super() is to be avoided as it's brittle and
    inconsistent. Cleanup shouldn't be duplicated between
    sb->put_super() as sb->put_super() is only called if sb->s_root has
    been set aka when the filesystem has been successfully born
    (SB_BORN). That complexity should be avoided.

    This also means that block devices are to be closed in sb->kill_sb()
    instead of sb->put_super(). More details in the lower section.
(3) Make it possible to lookup or create a superblock before opening
    block devices

    There's a subtle dependency on (2) as some filesystems did rely on
    fill_super() to be called in order to correctly clean up
    sb->s_fs_info. All these filesystems have been fixed.
(4) Switch most filesystem to follow the same logic as the generic mount
    code now does as outlined in (3).
(5) Use the superblock as the holder of the block device.
    We can now easily go back from block device to owning superblock.
(6) Export and extend the generic fs_holder_ops and use them as holder
    ops everywhere and remove the filesystem specific holder ops.
(7) Call from the block layer up into the filesystem layer when the
    block device is removed, allowing to shut down the filesystem
    without risk of deadlocks.
(8) Get rid of get_super().
    We can now easily go back from the block device to owning superblock
    and can call up from the block layer into the filesystem layer when
    the device is removed. So no need to wade through all registered
    superblock to find the owning superblock anymore.

Optional Details
================

These outlined changes solve a long-standing deadlock and also interact
with general locking requirements between the block and fs layer. That's
probably worth describing a little bit.

The locking rules are such that sb->s_umount nests in
gendisk->open_mutex which is acquired when block devices are opened.
When a new superblock is allocated or an existing superblock is found
sb->s_umount is acquired and held until the superblock is fully
initialized. Since block devices where opened before superblock lookup
or allocation happened the locking order was guaranteed.

But now that we first allocate a new superbock we return with
sb->s_umount of the newly created superblock held. So calling into the
block layer to open block devices would cause us to violate
aforementioned locking order.

In order to preserve locking order sb->s_umount is now dropped before
opening block devices and reacquired before the filesystem provided
fill_super() method is called. This is safe because the superblock isn't
yet SB_BORN and is ignored by all iterators.

This is straightforward but has consequences. Iterators over
super_blocks (global list of superblocks) and fs_supers (list of
superblocks for a given filesystem type) that grab a temporary reference
to the superblock can now also grab sb->s_umount while the creator of
the superblock is opening block devices before they have managed to
reacquire sb->s_umount to call fill_super().

So whereas before such iterators or concurrent mounters would have
simply slept on s_umount until SB_BORN was set or the superblock was
discard due to initalization failure they would now spin.

Especially since the task that created the new superblock could be
sleeping on bdev_lock or open_mutex one iterator or concurrent mounter
waiting on SB_BORN will always spin somewhere.

This is all caused by requiring sb->s_umount to be held to check whether
the superblock is still alive or has been SB_BORN yet. To fix this
properly a method to wait on nascent superblocks to either become born
(SB_BORN) or dying (SB_DYING) without requiring s_umount to be held is
added using a wait_var_event() mechanism. This allows concurrent
iterators and mounters to sleep and be woken when the superblock is
SB_BORN or SB_DYING. This allows for other simplifications as well. A
few of them are already included with more to come next cycle.

A caller realizing that a superblock isn't SB_BORN yet adds itself to a
waitqueue and will be woken if the superblock is SB_BORN or SB_DYING.

This also allows us to fix another long-standing issue properly. As
mentioned in the overview this work changes where block devices are
closed. Before this series block devices where closed in sb->put_super()
which is called with sb->s_umount held from generic_shutdown_super()
which itself is called from deactivate_locked_super().

To close block device blkdev_put() must be used which can cause
sb->s_umount to be acquired when device changes are triggered that would
cause the block device to be invalidated. But since blkdev_put() was
called from sb->put_super() with sb->s_umount held this would deadlock.

To fix this closing block devices has been moved from sb->put_super()
into sb->kill_sb() which is called from deactivate_locked_super() after
generic_shutdown_super() has removed the superblock from the superblocks
list of the filesystem type and given up sb->s_umount.

This brings another problem to the table. Before, closing block devices
with sb->s_umount held from sb->put_super() guaranteed that a concurrent
mounter slept on sb->s_umount until the block device was closed.

However, sb->kill_sb() doesn't hold sb->s_umount anymore (otherwise
the aforementioned deadlock would just occur earlier so nothing would be
fixed). This may cause a concurrent mounter to fail with EBUSY in case
blkdev_put() hadn't finished yet. While that's probably not a big deal
it is something that can be avoided with the new mechanism.

To fix this, the removal of the superblock from the list of superblocks
of the filesystem type is moved from generic_shutdown_super() into
deactivate_locked_super() after sb->s_umount has been given up and
sb->kill_sb() has been called.

This is fine since generic_shutdown_super() wakes anyone waiting on
SB_DYING. This includes all iterators that don't need to wait for the
devices to be closed. They just care about whether the superblock is
still alive.

Any concurrent mounter on the other hand is made to wait for SB_DEAD.
This gets sent after block devices have been closed and the superblock
has been removed from the list of superblocks for the filesystem type.

Overall this should leave us in a much better state overall.

/* Testing */
clang: Ubuntu clang version 15.0.7
gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0

All generic super patches are based on v6.5-rc1 and have been sitting in
linux-next. Darrick's trees bring in v6.5-rc2. No build failures or
warnings were observed. All old and new tests in selftests, and LTP pass
without regressions.

/* Conflicts */
It will also have conflicts with the following trees:

(1) linux-next: manual merge of the vfs-brauner tree with the xfs tree
    https://lore.kernel.org/lkml/20230823093852.7bf03b7e@xxxxxxxxxxxxxxxx

(2) Re: linux-next: manual merge of the vfs-brauner tree with the ext4 tree
    https://lore.kernel.org/lkml/20230821102559.35c8ef51@xxxxxxxxxxxxxxxx

    The link to the "Re:" is intentional as it contains the correct
    conflict resolution.

(3) linux-next: manual merge of the block tree with the djw-vfs, vfs-brauner trees
    https://lore.kernel.org/lkml/20230822131541.7667f165@xxxxxxxxxxxxxxxx

(4) This will also cause a minor merge conflict with the v6.6-vfs.ctime
    tag which I would recommend to merge first should you decide to
    pull this. My proposed conflict resolution is below:

diff --cc fs/ext4/super.c
index cb1ff47af156,60d2815a0b7e..73547d2334fd

--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@@ -7278,8 -7271,8 +7271,8 @@@ static struct file_system_type ext4_fs_
        .name                   = "ext4",
        .init_fs_context        = ext4_init_fs_context,
        .parameters             = ext4_param_specs,
-       .kill_sb                = kill_block_super,
+       .kill_sb                = ext4_kill_sb,
 -      .fs_flags               = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
 +      .fs_flags               = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
  };
  MODULE_ALIAS_FS("ext4");

diff --cc fs/xfs/xfs_super.c
index 4b10edb2c972,8fee15292499..c79eac048456
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@@ -2008,8 -2032,8 +2032,8 @@@ static struct file_system_type xfs_fs_t
        .name                   = "xfs",
        .init_fs_context        = xfs_init_fs_context,
        .parameters             = xfs_fs_parameters,
-       .kill_sb                = kill_block_super,
+       .kill_sb                = xfs_kill_sb,
 -      .fs_flags               = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
 +      .fs_flags               = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
  };
  MODULE_ALIAS_FS("xfs");

The following changes since commit fdf0eaf11452d72945af31804e2a1048ee1b574c:

  Linux 6.5-rc2 (2023-07-16 15:10:37 -0700)

are available in the Git repository at:

  git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/vfs/vfs tags/v6.6-vfs.super

for you to fetch changes up to cd4284cfd3e11c7a49e4808f76f53284d47d04dd:

  Merge tag 'vfs-6.6-merge-3' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux (2023-08-23 13:09:22 +0200)

Please consider pulling these changes from the signed v6.6-vfs.super tag.

Thanks!
Christian

----------------------------------------------------------------
v6.6-vfs.super

----------------------------------------------------------------
Christian Brauner (7):
      super: use locking helpers
      super: make locking naming consistent
      super: wait for nascent superblocks
      super: wait until we passed kill super
      super: use higher-level helper for {freeze,thaw}
      Merge tag 'vfs-6.6-merge-2' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux
      Merge tag 'vfs-6.6-merge-3' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux

Christoph Hellwig (38):
      fs: stop using bdev->bd_super in mark_buffer_write_io_error
      ext4: don't use bdev->bd_super in __ext4_journal_get_write_access
      ocfs2: stop using bdev->bd_super for journal error logging
      fs, block: remove bdev->bd_super
      xfs: reformat the xfs_fs_free prototype
      xfs: remove a superfluous s_fs_info NULL check in xfs_fs_put_super
      xfs: free the xfs_mount in ->kill_sb
      xfs: remove xfs_blkdev_put
      xfs: close the RT and log block devices in xfs_free_buftarg
      xfs: close the external block devices in xfs_mount_free
      xfs: document the invalidate_bdev call in invalidate_bdev
      ext4: close the external journal device in ->kill_sb
      exfat: don't RCU-free the sbi
      exfat: free the sbi and iocharset in ->kill_sb
      ntfs3: rename put_ntfs ntfs3_free_sbi
      ntfs3: don't call sync_blockdev in ntfs_put_super
      ntfs3: free the sbi in ->kill_sb
      fs: export setup_bdev_super
      nilfs2: use setup_bdev_super to de-duplicate the mount code
      ext4: make the IS_EXT2_SB/IS_EXT3_SB checks more robust
      fs: use the super_block as holder when mounting file systems
      fs: stop using get_super in fs_mark_dead
      fs: export fs_holder_ops
      ext4: drop s_umount over opening the log device
      ext4: use fs_holder_ops for the log device
      xfs: drop s_umount over opening the log and RT devices
      xfs use fs_holder_ops for the log and RT devices
      nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
      block: simplify the disk_force_media_change interface
      floppy: call disk_force_media_change when changing the format
      amiflop: don't call fsync_bdev in FDFMTBEG
      dasd: also call __invalidate_device when setting the device offline
      block: drop the "busy inodes on changed media" log message
      block: consolidate __invalidate_device and fsync_bdev
      block: call into the file system for bdev_mark_dead
      block: call into the file system for ioctl BLKFLSBUF
      fs: remove get_super
      fs: simplify invalidate_inodes

Darrick J. Wong (3):
      fs: distinguish between user initiated freeze and kernel initiated freeze
      fs: wait for partially frozen filesystems
      xfs: stabilize fs summary counters for online fsck

Jan Kara (1):
      fs: open block device after superblock creation

 Documentation/filesystems/vfs.rst |   6 +-
 block/bdev.c                      |  69 ++--
 block/disk-events.c               |  23 +-
 block/genhd.c                     |  45 +--
 block/ioctl.c                     |   9 +-
 block/partitions/core.c           |   5 +-
 drivers/block/amiflop.c           |   1 -
 drivers/block/floppy.c            |   2 +-
 drivers/block/loop.c              |   6 +-
 drivers/block/nbd.c               |   8 +-
 drivers/s390/block/dasd.c         |   7 +-
 fs/buffer.c                       |  11 +-
 fs/cramfs/inode.c                 |   8 +-
 fs/exfat/exfat_fs.h               |   2 -
 fs/exfat/super.c                  |  39 +-
 fs/ext4/ext4_jbd2.c               |   3 +-
 fs/ext4/super.c                   |  69 ++--
 fs/f2fs/gc.c                      |   8 +-
 fs/f2fs/super.c                   |   7 +-
 fs/fs-writeback.c                 |   4 +-
 fs/gfs2/super.c                   |  12 +-
 fs/gfs2/sys.c                     |   4 +-
 fs/inode.c                        |  17 +-
 fs/internal.h                     |   4 +-
 fs/ioctl.c                        |   8 +-
 fs/nilfs2/super.c                 |  81 ++--
 fs/ntfs3/super.c                  |  33 +-
 fs/ocfs2/journal.c                |   6 +-
 fs/romfs/super.c                  |  10 +-
 fs/super.c                        | 765 ++++++++++++++++++++++++++------------
 fs/xfs/scrub/fscounters.c         | 188 ++++++++--
 fs/xfs/scrub/scrub.c              |   6 +-
 fs/xfs/scrub/scrub.h              |   1 +
 fs/xfs/scrub/trace.h              |  26 ++
 fs/xfs/xfs_buf.c                  |   7 +-
 fs/xfs/xfs_super.c                | 136 ++++---
 include/linux/blk_types.h         |   1 -
 include/linux/blkdev.h            |  15 +-
 include/linux/fs.h                |  18 +-
 include/linux/fs_context.h        |   2 +
 40 files changed, 1043 insertions(+), 629 deletions(-)