Re: [PATCH RFC 00/12] Allow concurrent directory updates.

Daire Byrne <daire@xxxxxxxx> · Wed, 15 Jun 2022 14:46:14 +0100

Neil,

Firstly, thank you for your work on this. I'm probably the main
beneficiary of this (NFSD) effort atm so I feel extra special and
lucky!

I have done some quick artificial tests similar to before where I am
using a NFS server and client separated by an (extreme) 200ms of
latency (great for testing parallelism). I am only using NFSv3 due to
the NFSD_CACHE_SIZE_SLOTS_PER_SESSION parallelism limitations for
NFSv4.

Firstly, a client direct to server (VFS) with 10 simultaneous create
processes hitting the same directory:

client1 # for x in {1..1000}; do
    echo /srv/server1/data/touch.$x
done | xargs -n1 -P 10 -iX -t touch X 2>&1 | pv -l -a >|/dev/null

Without the patch ( on the client), this reaches a steady state of 2.4
creates/s and increasing the number of parallel create processes does
not change this aggregate performance.

With the patch, the creation rate increases to 15 creates/s and with
100 processes, it further scales up to 121 creates/s.

Now for the re-export case (NFSD) where an intermediary server
re-exports the originating server (200ms away) to clients on it's
local LAN, there is no noticeable improvement for a single (not
patched) client. But we do see an aggregate improvement when we use
multiple clients at once.

# pdsh -Rssh -w 'client[1-10]' 'for x in {1..1000}; do echo
/srv/reexport1/data/$(hostname -s).$x; done | xargs -n1 -P 10 -iX -t
touch X 2>&1' | pv -l -a >|/dev/null

Without the patch applied to the reexport server, the aggregate is
around 2.2 create/s which is similar to doing it directly to the
originating server from a single client (above).

With the patch, the aggregate increases to 15 creates/s for 10 clients
which again matches the results of a single patched client. Not quite
a x10 increase but a healthy improvement nonetheless.

However, it is at this point that I started to experience some
stability issues with the re-export server that are not present with
the vanilla unpatched v5.19-rc2 kernel. In particular the knfsd
threads start to lock up with stack traces like this:

[ 1234.460696] INFO: task nfsd:5514 blocked for more than 123 seconds.
[ 1234.461481]       Tainted: G        W   E     5.19.0-1.dneg.x86_64 #1
[ 1234.462289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1234.463227] task:nfsd            state:D stack:    0 pid: 5514
ppid:     2 flags:0x00004000
[ 1234.464212] Call Trace:
[ 1234.464677]  <TASK>
[ 1234.465104]  __schedule+0x2a9/0x8a0
[ 1234.465663]  schedule+0x55/0xc0
[ 1234.466183]  ? nfs_lookup_revalidate_dentry+0x3a0/0x3a0 [nfs]
[ 1234.466995]  __nfs_lookup_revalidate+0xdf/0x120 [nfs]
[ 1234.467732]  ? put_prev_task_stop+0x170/0x170
[ 1234.468374]  nfs_lookup_revalidate+0x15/0x20 [nfs]
[ 1234.469073]  lookup_dcache+0x5a/0x80
[ 1234.469639]  lookup_one_unlocked+0x59/0xa0
[ 1234.470244]  lookup_one_len_unlocked+0x1d/0x20
[ 1234.470951]  nfsd_lookup_dentry+0x190/0x470 [nfsd]
[ 1234.471663]  nfsd_lookup+0x88/0x1b0 [nfsd]
[ 1234.472294]  nfsd3_proc_lookup+0xb4/0x100 [nfsd]
[ 1234.473012]  nfsd_dispatch+0x161/0x290 [nfsd]
[ 1234.473689]  svc_process_common+0x48a/0x620 [sunrpc]
[ 1234.474402]  ? nfsd_svc+0x330/0x330 [nfsd]
[ 1234.475038]  ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd]
[ 1234.475772]  svc_process+0xbc/0xf0 [sunrpc]
[ 1234.476408]  nfsd+0xda/0x190 [nfsd]
[ 1234.477011]  kthread+0xf0/0x120
[ 1234.477522]  ? kthread_complete_and_exit+0x20/0x20
[ 1234.478199]  ret_from_fork+0x22/0x30
[ 1234.478755]  </TASK>

For whatever reason, they seem to affect our Netapp mounts and
re-exports rather than our originating Linux NFS servers (against
which all tests were done). This may be related to the fact that those
Netapps serve our home directories so there could be some unique
locking patterns going on there.

This issue made things a bit too unstable to test at larger scales or
with our production workloads.

So all in all, the performance improvements in the knfsd re-export
case is looking great and we have real world use cases that this helps
with (batch processing workloads with latencies >10ms). If we can
figure out the hanging knfsd threads, then I can test it more heavily.

Many thanks,

Daire

On Tue, 14 Jun 2022 at 00:19, NeilBrown <neilb@xxxxxxx> wrote:
>
> VFS currently holds an exclusive lock on a directory during create,
> unlink, rename.  This imposes serialisation on all filesystems though
> some may not benefit from it, and some may be able to provide finer
> grained locking internally, thus reducing contention.
>
> This series allows the filesystem to request that the inode lock be
> shared rather than exclusive.  In that case an exclusive lock will be
> held on the dentry instead, much as is done for parallel lookup.
>
> The NFS filesystem can easily support concurrent updates (server does
> any needed serialiation) so it is converted.
>
> This series also converts nfsd to use the new interfaces so concurrent
> incoming NFS requests in the one directory can be handled concurrently.
>
> As a net result, if an NFS mounted filesystem is reexported over NFS,
> then multiple clients can create files in a single directory and all
> synchronisation will be handled on the final server.  This helps hid
> latency on link from client to server.
>
> I include a few nfsd patches that aren't strictly needed for this work,
> but seem to be a logical consequence of the changes that I did have to
> make.
>
> I have only tested this lightly.  In particular the rename support is
> quite new and I haven't tried to break it yet.
>
> I post this for general review, and hopefully extra testing...  Daire
> Byrne has expressed interest in the NFS re-export parallelism.
>
> NeilBrown
>
>
> ---
>
> NeilBrown (12):
>       VFS: support parallel updates in the one directory.
>       VFS: move EEXIST and ENOENT tests into lookup_hash_update()
>       VFS: move want_write checks into lookup_hash_update()
>       VFS: move dput() and mnt_drop_write() into done_path_update()
>       VFS: export done_path_update()
>       VFS: support concurrent renames.
>       NFS: support parallel updates in the one directory.
>       nfsd: allow parallel creates from nfsd
>       nfsd: support concurrent renames.
>       nfsd: reduce locking in nfsd_lookup()
>       nfsd: use (un)lock_inode instead of fh_(un)lock
>       nfsd: discard fh_locked flag and fh_lock/fh_unlock
>
>
>  fs/dcache.c            |  59 ++++-
>  fs/namei.c             | 578 ++++++++++++++++++++++++++++++++---------
>  fs/nfs/dir.c           |  29 ++-
>  fs/nfs/inode.c         |   2 +
>  fs/nfs/unlink.c        |   5 +-
>  fs/nfsd/nfs2acl.c      |   6 +-
>  fs/nfsd/nfs3acl.c      |   4 +-
>  fs/nfsd/nfs3proc.c     |  37 +--
>  fs/nfsd/nfs4acl.c      |   7 +-
>  fs/nfsd/nfs4proc.c     |  61 ++---
>  fs/nfsd/nfs4state.c    |   8 +-
>  fs/nfsd/nfsfh.c        |  10 +-
>  fs/nfsd/nfsfh.h        |  58 +----
>  fs/nfsd/nfsproc.c      |  31 +--
>  fs/nfsd/vfs.c          | 243 ++++++++---------
>  fs/nfsd/vfs.h          |   8 +-
>  include/linux/dcache.h |  27 ++
>  include/linux/fs.h     |   1 +
>  include/linux/namei.h  |  30 ++-
>  19 files changed, 791 insertions(+), 413 deletions(-)
>
> --
> Signature
>