On Fri, 07 Feb 2025, John Stoffel wrote: > >>>>> "NeilBrown" == NeilBrown <neilb@xxxxxxx> writes: > > > This is my latest attempt at removing the requirement for an exclusive > > lock on a directory which performing updates in this. This version, > > inspired by Dave Chinner, goes a step further and allow async updates. > > This initial sentence reads poorly to me. I think you maybe are > trying to say: > > This is my latest attempt to removing the requirement for writers to > have an exclusive lock on a directory when performing updates on > entries in that directory. This allows for parallel updates by > multiple processes (connections? hosts? clients?) to improve scaling > of large filesystems. > > I get what you're trying to do here, and I applaud it! I just > struggled over the intro here. Yes, my intro was rather poorly worded. I think your version is much better. Thanks. NeilBrown > > > > The inode operation still requires the inode lock, at least a shared > > lock, but may return -EINPROGRES and then continue asynchronously > > without needing any ongoing lock on the directory. > > > An exclusive lock on the dentry is held across the entire operation. > > > This change requires various extra checks. rmdir must ensure there is > > no async creation still happening. rename between directories must > > ensure non of the relevant ancestors are undergoing async rename. There > > may be or checks that I need to consider - mounting? > > > One other important change since my previous posting is that I've > > dropped the idea of taking a separate exclusive lock on the directory > > when the fs doesn't support shared locking. This cannot work as it > > doeesn't prevent lookups and filesystems don't expect a lookup while > > they are changing a directory. So instead we need to choose between > > exclusive or shared for the inode on a case-by-case basis. > > > To make this choice we divide all ops into four groups: create, remove, > > rename, open/create. If an inode has no operations in the group that > > require an exclusive lock, then a flag is set on the inode so that > > various code knows that a shared lock is sufficient. If the flag is not > > set, an exclusive lock is obtained. > > > I've also added rename handling and converted NFS to use all _async ops. > > > The motivation for this comes from the general increase in scale of > > systems. We can support very large directories and many-core systems > > and applications that choose to use large directories can hit > > unnecessary contention. > > > NFS can easily hit this when used over a high-latency link. > > Lustre already has code to allow concurrent directory updates in the > > back-end filesystem (ldiskfs - a slightly modified ext4). > > Lustre developers believe this would also benefit the client-side > > filesystem with large core counts. > > > The idea behind the async support is to eventually connect this to > > io_uring so that one process can launch several concurrent directory > > operations. I have not looked deeply into io_uring and cannot be > > certain that the interface I've provided will be able to be used. I > > would welcome any advice on that matter, though I hope to find time to > > explore myself. For now if any _async op returns -EINPROGRESS we simply > > wait for the callback to indicate completion. > > > Test status: only light testing. It doesn't easily blow up, but lockdep > > complains that repeated calls to d_update_wait() are bad, even though > > it has balanced acquire and release calls. Weird? > > > Thanks, > > NeilBrown > > > [PATCH 01/19] VFS: introduce vfs_mkdir_return() > > [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel() > > [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() > > [PATCH 04/19] VFS: change kern_path_locked() and > > [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() > > [PATCH 06/19] VFS: repack DENTRY_ flags. > > [PATCH 07/19] VFS: repack LOOKUP_ bit flags. > > [PATCH 08/19] VFS: introduce lookup_and_lock() and friends > > [PATCH 09/19] VFS: add _async versions of the various directory > > [PATCH 10/19] VFS: introduce inode flags to report locking needs for > > [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use > > [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock > > [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with > > [PATCH 14/19] VFS: Ensure no async updates happening in directory > > [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when > > [PATCH 16/19] VFS: add lookup_and_lock_rename() > > [PATCH 17/19] nfsd: use lookup_and_lock_one() and > > [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async > > [PATCH 19/19] nfs: switch to _async for all directory ops. > >