On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote: > This is my latest attempt at removing the requirement for an exclusive > lock on a directory which performing updates in this. This version, > inspired by Dave Chinner, goes a step further and allow async updates. > > The inode operation still requires the inode lock, at least a shared > lock, but may return -EINPROGRES and then continue asynchronously > without needing any ongoing lock on the directory. > > An exclusive lock on the dentry is held across the entire operation. > > This change requires various extra checks. rmdir must ensure there is > no async creation still happening. rename between directories must > ensure non of the relevant ancestors are undergoing async rename. There > may be or checks that I need to consider - mounting? Mounting takes an exclusive lock on the target inode in do_lock_mount() and finish_automount(). As long as dont_mount() can't happen asynchronously in vfs_rmdir(), vfs_unlink() or vfs_rename() it should be fine. > > One other important change since my previous posting is that I've > dropped the idea of taking a separate exclusive lock on the directory > when the fs doesn't support shared locking. This cannot work as it > doeesn't prevent lookups and filesystems don't expect a lookup while > they are changing a directory. So instead we need to choose between > exclusive or shared for the inode on a case-by-case basis. Which is possibly fine if we do it similar to what I suggested in the series. As it stands with the separate methods it's a no-go for me. But that's a solvable problem, I think. > To make this choice we divide all ops into four groups: create, remove, > rename, open/create. If an inode has no operations in the group that > require an exclusive lock, then a flag is set on the inode so that > various code knows that a shared lock is sufficient. If the flag is not > set, an exclusive lock is obtained. > > I've also added rename handling and converted NFS to use all _async ops. > > The motivation for this comes from the general increase in scale of > systems. We can support very large directories and many-core systems > and applications that choose to use large directories can hit > unnecessary contention. > > NFS can easily hit this when used over a high-latency link. > Lustre already has code to allow concurrent directory updates in the > back-end filesystem (ldiskfs - a slightly modified ext4). > Lustre developers believe this would also benefit the client-side > filesystem with large core counts. > > The idea behind the async support is to eventually connect this to > io_uring so that one process can launch several concurrent directory > operations. I have not looked deeply into io_uring and cannot be > certain that the interface I've provided will be able to be used. I > would welcome any advice on that matter, though I hope to find time to > explore myself. For now if any _async op returns -EINPROGRESS we simply > wait for the callback to indicate completion. > > Test status: only light testing. It doesn't easily blow up, but lockdep > complains that repeated calls to d_update_wait() are bad, even though > it has balanced acquire and release calls. Weird? > > Thanks, > NeilBrown > > [PATCH 01/19] VFS: introduce vfs_mkdir_return() > [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel() > [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() > [PATCH 04/19] VFS: change kern_path_locked() and > [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() > [PATCH 06/19] VFS: repack DENTRY_ flags. > [PATCH 07/19] VFS: repack LOOKUP_ bit flags. > [PATCH 08/19] VFS: introduce lookup_and_lock() and friends > [PATCH 09/19] VFS: add _async versions of the various directory > [PATCH 10/19] VFS: introduce inode flags to report locking needs for > [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use > [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock > [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with > [PATCH 14/19] VFS: Ensure no async updates happening in directory > [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when > [PATCH 16/19] VFS: add lookup_and_lock_rename() > [PATCH 17/19] nfsd: use lookup_and_lock_one() and > [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async > [PATCH 19/19] nfs: switch to _async for all directory ops.