On Tue, Apr 25, 2023 at 1:32 PM Chris Chilvers <chilversc@xxxxxxxxx> wrote: > > We've been using NFS with FS-Cache to act as a caching proxy (re-export). When > under load we've encountered an issue where all the nfsd processes seem to get > stuck in I/O wait. > > The proxy is running an older version of the nfs-fscache-nfsfs branch taken > from 17th Nov 2022, > https://github.com/DaveWysochanskiRH/kernel/commit/52acbd4584d1b83c844371e48de1a1e39d255a6d. > > The proxy mounts the source NFS server using NFS v3 with FS-Cache enabled. > The clients mount the proxy using NFS v4.2 to avoid issues with file handle > sizes. > > dmesg suggests that most of the CPU tasks are blocked by nfsd on > folio_wait_bit_common > > INFO: task nfsd:180059 blocked for more than 120 seconds. > Not tainted 6.1.0-rc5+ #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:nfsd state:D stack:0 pid:180059 ppid:2 > flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x31e/0x14a0 > ? _raw_spin_unlock_irqrestore+0x27/0x50 > schedule+0x6b/0x110 > io_schedule+0x46/0x80 > folio_wait_bit_common+0x124/0x340 > ? xas_find+0x7c/0x1e0 > ? xas_find_marked+0x1f7/0x370 > ? filemap_invalidate_unlock_two+0x50/0x50 > folio_wait_private_2+0x2c/0x50 > nfs_release_folio+0x5e/0xb0 [nfs] > filemap_release_folio+0x66/0x80 > invalidate_inode_pages2_range+0x240/0x400 > invalidate_inode_pages2+0x17/0x20 > nfs_clear_invalid_mapping+0x1d8/0x2d0 [nfs] > nfs_revalidate_mapping+0x55/0x70 [nfs] > nfs_file_read+0x4c/0xc0 [nfs] > generic_file_splice_read+0x8f/0x160 > do_splice_to+0x7d/0xc0 > splice_direct_to_actor+0xad/0x210 > ? fsid_source+0x60/0x60 [nfsd] > ? nfsd_file_do_acquire+0xacf/0xbd0 [nfsd] > nfsd_splice_read+0x7c/0x120 [nfsd] > nfsd_read+0x147/0x1b0 [nfsd] > nfsd3_proc_read+0x1b5/0x2d0 [nfsd] > ? svcxdr_decode_nfs_fh3+0x4e/0x130 [nfsd] > nfsd_dispatch+0x173/0x2b0 [nfsd] > svc_process_common+0x3c8/0x620 [sunrpc] > ? nfsd_svc+0x3e0/0x3e0 [nfsd] > ? nfsd_shutdown_threads+0xb0/0xb0 [nfsd] > svc_process+0xb2/0x100 [sunrpc] > nfsd+0xda/0x190 [nfsd] > kthread+0xfa/0x130 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x1f/0x30 > </TASK> > > Checking /proc/PID/stack for all the nfsd processes had the following counts: > > * 72 - nfsd_file_do_acquire+0x20b/0xbd0 > * 36 - folio_wait_bit_common+0x124/0x340 > * 10 - fscache_begin_operation.part.0+0x288/0x2b0 [fscache] > * 8 - __fscache_use_cookie+0x2e5/0x320 [fscache] > * 2 - ext4_llseek+0x91/0x110 > > dmesg also contains a lot of errors about timeouts waiting for the cookie state > to change: > > FS-Cache: fscache_begin_operation: cookie state change wait timed > out: cookie->state=1 state=1 > FS-Cache: O-cookie c=0026b915 [fl=602c na=1 nA=2 s=L] > FS-Cache: O-cookie V=00000001 > [Infs,3.0,2,,8335540a,6465ebb4,,,d0,100000,100000,249f0,249f0,249f0,249f0,1] > FS-Cache: O-key=[32] > 'b4eb6564010000005509a81a02000000ffffffff000000000400fd0301000000' > FS-Cache: fscache_begin_operation: cookie state change wait timed > out: cookie->state=1 state=1 > FS-Cache: O-cookie c=0024fd88 [fl=602c na=1 nA=2 s=L] > FS-Cache: O-cookie V=00000001 > [Infs,3.0,2,,8335540a,6465ebb4,,,d0,100000,100000,249f0,249f0,249f0,249f0,1] > FS-Cache: O-key=[32] > 'b4eb6564010000007df7aa1702000000ffffffff000000000400fd0301000000' > FS-Cache: fscache_begin_operation: cookie state change wait timed > out: cookie->state=1 state=1 > FS-Cache: O-cookie c=0026d61a [fl=4024 na=1 nA=2 s=L] > FS-Cache: O-cookie V=00000001 > [Infs,3.0,2,,8335540a,6465ebb4,,,d0,100000,100000,249f0,249f0,249f0,249f0,1] > FS-Cache: O-key=[32] > 'b4eb656401000000b79947ce01000000ffffffff000000000400fd0301000000' > > The clients were shutdown but the proxy instance kept for further diagnosis. > The nfsd sockets remained stuck in CLOSE_WAIT, and the nfsd processes remained > stuck on various tasks for at least 4 days. It seems at some point over the > weekend the issue resolved itself and now all the nfsd threads are idle. > Let me think about this a bit more, though at first glance it does not ring a bell. > I'm going to try to see if I can reproduce this on the latest versions of the > patches with the lockdep enabled. Though currently we've only seen this issue > while the system is under a heavy production workload (rendering) after several > days. > May I suggest using anna's linux-next branch for further testing? The v11 NFS netfs patches are in there, along with other patches for the next merge window. git://git.linux-nfs.org/projects/anna/linux-nfs.git $ git log --oneline remotes/nfs-client-anna/linux-next | head --lines=20 e025f0a73f6a NFS: Cleanup unused rpc_clnt variable c5733ae6dc89 NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static 691d0b782066 SUNRPC: remove the maximum number of retries in call_bind_status ec108d3cc766 NFS: Convert readdir page array functions to use a folio 61f02e0ab81e NFS: Convert the readdir array-of-pages into an array-of-folios 3db63daabe21 NFSv3: handle out-of-order write replies. 03f5bd75a4c1 NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit 0631d5e02a1c NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API 000dbe0bec05 NFS: Convert buffered read paths to use netfs when fscache is enabled 88a4d7bdeec9 NFS: Configure support for netfs when NFS fscache is configured 01c3a40084a4 NFS: Rename readpage_async_filler to nfs_read_add_folio 703c6d03f165 sunrpc: simplify one-level sysctl registration for debug_table 32e356be32b6 sunrpc: move sunrpc_table and proc routines above c946cb69f238 sunrpc: simplify one-level sysctl registration for xs_tunables_table 17c6d0ce8340 sunrpc: simplify one-level sysctl registration for xr_tunables_table 39724217447f nfs: simplify two-level sysctl registration for nfs_cb_sysctls a2183160ca7e nfs: simplify two-level sysctl registration for nfs4_cb_sysctls c1d889cf99b8 lockd: simplify two-level sysctl registration for nlm_sysctls 40882deb83c2 NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease 09a9639e56c0 Linux 6.3-rc6 > On Thu, 23 Mar 2023 at 17:50, David Wysochanski <dwysocha@xxxxxxxxxx> wrote: > > > > On Mon, Feb 20, 2023 at 8:43 AM Dave Wysochanski <dwysocha@xxxxxxxxxx> wrote: > > > > > > Trond, this v11 patchset addresses your latest feedback on patch #2, > > > and I did a little bit of cleanup to patch 3 (see Changes notes below). > > > I'm not sure of further changes to patch #3 without a more in-depth > > > review with specifics, if you feel the current approach is unacceptable [1]. > > > > > > > Trond and/or Anna, > > > > Have you had a chance to review this patchset and do you have further > > concerns? > > > > Note that patch #3 will require a small fixup to apply after v6.3-rc1 > > due to this commit: > > 8bb7cd842c44 nfs: use bvec_set_page to initialize bvecs > > > > There is also still the small open issue of netfs counting read_bytes > > in its unlock page path, but I view that as a netfs issue. I'll followup > > with David Howells on that. > > > > > > > Ben and Daire, if you could test this set and provide you feedback > > > and a Tested-By: that would be appreciated. This set addresses > > > the existing NFS + fscache performance concerns seen by a few > > > users [5], which is due to utilization use of the deprecated > > > single-page function, fscache_fallback_read_page(). However, > > > until "known issue #1" below is also resolved, even with these > > > patches, performance of NFS+fscache will still be a problem > > > in some scenarios. > > > > > > This patchset converts NFS with fscache buffered read IO paths to > > > use the netfs API with a non-invasive approach. The existing NFS pgio > > > layer does not need extensive changes, and is the best way so far I've > > > found to address Trond's previous concerns about modifying the IO > > > path [2] as well as only enabling netfs when fscache is configured > > > and enabled [3]. I have not attempted performance comparisions to > > > address Chuck Lever's concern [4] because we are not converting the > > > non-fscache enabled NFS IO paths to netfs. > > > > > > The patchset is based on Trond's latest 'testing' branch which includes > > > his folio patchset, and is based on 6.2-rc5. It has been pushed to > > > github at: > > > https://github.com/DaveWysochanskiRH/kernel/commits/nfs-fscache-netfs > > > https://github.com/DaveWysochanskiRH/kernel/commit/6424e4f139652b7552eff26eb5da1f2282d35616 > > > > > > Changes since v10 [6] > > > ===================== > > > PATCH6: Dropped > > > PATCH1: Rename nfs_pageio_add_page to nfs_read_add_folio > > > PATCH2: Use anonymous union to add struct netfs_inode to nfs_inode (Trond) [7] > > > PATCH3: Change nfs_netfs_readpage_release() to nfs_netfs_folio_unlock() > > > > > > Testing > > > ======= > > > I did a full round of testing on this because it was rebased on top of > > > Trond's testing branch which included his folio series. > > > All of my unit tests pass except the one with the known issue #1 below. > > > Multiple runs of xfstests generic tests (applicable to NFS) were run with > > > with various servers, both with and without fscache enabled, and > > > compared to baseline (Trond's testing branch). No new failures were > > > observed with these patches, and in some xfstest instances, this > > > patchset improves the results (some tests that were failing now pass). > > > - hammerspace(pNFS flexfiles): NFS4.1, NFS4.2 > > > - NetApp(pNFS filelayout): NFS4.1, NFS4.0, NFS3 > > > - RHEL9: NFS4.2, NFS4.1, NFS4.0, NFS3 > > > > > > Known issues > > > ============ > > > 1. Unit test setting rsize < readahead does not properly read from > > > fscache but re-reads data from the NFS server > > > * This will be fixed with another dhowells patch [8]: > > > "[PATCH v6 2/2] mm, netfs, fscache: Stop read optimisation when folio removed from pagecache" > > > * Daire Byrne verified the patch fixes his issue as well > > > > > > 2. "Cache volume key already in use" after xfstest runs involving multiple mounts > > > * Simple reproducer requires just two mounts as follows: > > > mount -overs=4.1,fsc,nosharecache -o context=system_u:object_r:root_t:s0 nfs-server:/exp1 /mnt1 > > > mount -overs=4.1,fsc,nosharecache -o context=system_u:object_r:root_t:s0 nfs-server:/exp2 /mnt2 > > > * This should be fixed with dhowells patch [9]: > > > "[PATCH v5] vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing" > > > > > > > > > References > > > ========== > > > [1] https://lore.kernel.org/linux-nfs/0676ecb2bb708e6fc29dbbe6b44551d6a0d021dc.camel@xxxxxxxxxx/ > > > [2] https://lore.kernel.org/linux-nfs/9cfd5bc3cfc6abc2d3316b0387222e708d67f595.camel@xxxxxxxxxxxxxxx/ > > > [3] https://lore.kernel.org/linux-nfs/da9200f1bded9b8b078a7aef227fd6b92eb028fb.camel@xxxxxxxxxxxxxxx/ > > > [4] https://lore.kernel.org/linux-nfs/0A640C47-5F51-47E8-864D-E0E980F8B310@xxxxxxxxxx/ > > > [5] https://lore.kernel.org/linux-nfs/CA+QRt4tPqH87NVkoETLjxieGjZ_f7XxRj+xS3NVxcJ+b8AAKQg@xxxxxxxxxxxxxx/ > > > [6] https://lore.kernel.org/linux-nfs/20221103161637.1725471-1-dwysocha@xxxxxxxxxx/ > > > [7] https://lore.kernel.org/linux-nfs/4d60636f62df4f5c200666ed2d1a5f2414c18e1f.camel@xxxxxxxxxx/ > > > [8] https://lore.kernel.org/linux-nfs/20230216150701.3654894-1-dhowells@xxxxxxxxxx/T/#mf3807fa68fb6d495b87dde0d76b5237833a0cc81 > > > [9] https://lore.kernel.org/linux-kernel/217595.1662033775@xxxxxxxxxxxxxxxxxxxxxx/ > > > > > > Dave Wysochanski (5): > > > NFS: Rename readpage_async_filler to nfs_read_add_folio > > > NFS: Configure support for netfs when NFS fscache is configured > > > NFS: Convert buffered read paths to use netfs when fscache is enabled > > > NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API > > > NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit > > > > > > fs/nfs/Kconfig | 1 + > > > fs/nfs/fscache.c | 242 ++++++++++++++++++++++--------------- > > > fs/nfs/fscache.h | 131 ++++++++++++++------ > > > fs/nfs/inode.c | 2 + > > > fs/nfs/internal.h | 9 ++ > > > fs/nfs/iostat.h | 17 --- > > > fs/nfs/nfstrace.h | 91 -------------- > > > fs/nfs/pagelist.c | 4 + > > > fs/nfs/read.c | 105 ++++++++-------- > > > fs/nfs/super.c | 11 -- > > > include/linux/nfs_fs.h | 25 ++-- > > > include/linux/nfs_iostat.h | 12 -- > > > include/linux/nfs_page.h | 3 + > > > include/linux/nfs_xdr.h | 3 + > > > 14 files changed, 317 insertions(+), 339 deletions(-) > > > > > > -- > > > 2.31.1 > > > > > > -- > > > Linux-cachefs mailing list > > > Linux-cachefs@xxxxxxxxxx > > > https://listman.redhat.com/mailman/listinfo/linux-cachefs > > > > > > > -- > > Linux-cachefs mailing list > > Linux-cachefs@xxxxxxxxxx > > https://listman.redhat.com/mailman/listinfo/linux-cachefs >