Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

"NeilBrown" <neilb@xxxxxxx> · Wed, 22 Jan 2025 12:04:13 +1100

On Wed, 22 Jan 2025, Dave Chinner wrote:
> 
> Anyone who has been following io_uring development should know all
> these things about async processing already. There's a reason that
> that infrastructure exists: async processing is more efficient and
> faster than the concurrent synchronous processing model being
> proposed here....

I understand that asynchronous is best.  I think we are a long way from
achieving that.  I think shared locking is still a good step in that
direction.

Shared locking allows the exclusion to be pushed down into the
filesystem to whatever extend the filesystem needs.  That will be needed
for an async approach too.

We already have a hint of async in the dcache in that ->lookup() can
complete without a result if an intent flag is set.  The actually lookup
might then happen any time before the intended operation completes.  For
NFS exclusive open, that lookup is combined with the create/open.  For
unlink (which doesn't have an intent flag yet) it could be combined with
the nfs REMOVE operation (if that seemed like a good idea).  Other
filesystems could do other things.  But this is just a hint of aysnc as
yet.

I imagine that in the longer term we could drop the i_rwsem completely
for directories.  The VFS would set up a locked dentry much like it does
before ->lookup and then calls into the filesystem.  The filesystem
might do the op synchronously or might take note of what is needed and
schedule the relevant changes or whatever.  When the op finished it does
clear_and_wake_up_bit() (or similar) after stashing the result ...
somewhere.

For synchronous operations like syscalls, an on-stack result struct would
be passed which contains an error status and optionally a new dentry (if
e.g. mkdir found it needed to splice in an existing dentry).

For async operations io_uring would allocate the result struct and would
store in it a callback function to be called after the
clear_and_wake_up_bit(). 

Rather than using i_rwsem to block additions to a directory while it is
being removed, we would lock the dentry (so no more locked children can
be added) and wait for any locked children to be unlocked.

There are doubtless details that I have missed but it is clear that to
allow async dirops we need to remove the need for i_rwsem, and I think
transitioning from exclusive to shared is a useful step in that
direction.

I'm almost tempted to add the result struct to the new _shared
inode_operations that I want to add, but that would likely be premature.

Thanks,
NeilBrown