Re: [PATCH v4 2/2] NFSD: add shrinker to reap courtesy clients on low memory condition

dai.ngo@xxxxxxxxxx · Sat, 3 Sep 2022 10:59:17 -0700

On 9/3/22 10:29 AM, Chuck Lever III wrote:

On Sep 3, 2022, at 1:03 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:

On 9/2/22 6:26 PM, Chuck Lever III wrote:
On Sep 2, 2022, at 3:34 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:

On 9/2/22 10:58 AM, Chuck Lever III wrote:
Or, nfsd_courtesy_client_count() could return
nfsd_couresy_client_count. nfsd_courtesy_client_scan() could
then look something like this:

	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
		return SHRINK_STOP;

	nfsd_get_client_reaplist(nn, reaplist, lt);
	list_for_each_safe(pos, next, &reaplist) {
		clp = list_entry(pos, struct nfs4_client, cl_lru);
		trace_nfsd_clid_purged(&clp->cl_clientid);
		list_del_init(&clp->cl_lru);
		expire_client(clp);
		count++;
	}
	return count;
This does not work, we cannot expire clients on the context of
scan callback due to deadlock problem.
Correct, we don't want to start shrinker laundromat activity if
the allocation request specified that it cannot wait. But are
you sure it doesn't work if sc_gfp_flags is set to GFP_KERNEL?
It can be deadlock due to lock ordering problem:

======================================================
WARNING: possible circular locking dependency detected
5.19.0-rc2_sk+ #1 Not tainted
------------------------------------------------------
lck/31847 is trying to acquire lock:
ffff88811d268850 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}, at: btrfs_inode_lock+0x38/0x70
#012but task is already holding lock:
ffffffffb41848c0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x506/0x1db0
#012which lock already depends on the new lock.
#012the existing dependency chain (in reverse order) is:

#012-> #1 (fs_reclaim){+.+.}-{0:0}:
       fs_reclaim_acquire+0xc0/0x100
       __kmalloc+0x51/0x320
       btrfs_buffered_write+0x2eb/0xd90
       btrfs_do_write_iter+0x6bf/0x11c0
       do_iter_readv_writev+0x2bb/0x5a0
       do_iter_write+0x131/0x630
       nfsd_vfs_write+0x4da/0x1900 [nfsd]
       nfsd4_write+0x2ac/0x760 [nfsd]
       nfsd4_proc_compound+0xce8/0x23e0 [nfsd]
       nfsd_dispatch+0x4ed/0xc10 [nfsd]
       svc_process_common+0xd3f/0x1b00 [sunrpc]
       svc_process+0x361/0x4f0 [sunrpc]
       nfsd+0x2d6/0x570 [nfsd]
       kthread+0x2a1/0x340
       ret_from_fork+0x22/0x30

#012-> #0 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}:
       __lock_acquire+0x318d/0x7830
       lock_acquire+0x1bb/0x500
       down_write+0x82/0x130
       btrfs_inode_lock+0x38/0x70
       btrfs_sync_file+0x280/0x1010
       nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
       nfsd_file_put+0xd4/0x110 [nfsd]
       release_all_access+0x13a/0x220 [nfsd]
       nfs4_free_ol_stateid+0x40/0x90 [nfsd]
       free_ol_stateid_reaplist+0x131/0x210 [nfsd]
       release_openowner+0xf7/0x160 [nfsd]
       __destroy_client+0x3cc/0x740 [nfsd]
       nfsd_cc_lru_scan+0x271/0x410 [nfsd]
       shrink_slab.constprop.0+0x31e/0x7d0
       shrink_node+0x54b/0xe50
       try_to_free_pages+0x394/0xba0
       __alloc_pages_slowpath.constprop.0+0x5d2/0x1db0
       __alloc_pages+0x4d6/0x580
       __handle_mm_fault+0xc25/0x2810
       handle_mm_fault+0x136/0x480
       do_user_addr_fault+0x3d8/0xec0
       exc_page_fault+0x5d/0xc0
       asm_exc_page_fault+0x27/0x30
Is this deadlock possible only via the filecache close
paths?
I'm not sure, but there is another back trace below that shows
another problem.

  I can't think of a reason the laundromat has to
wait for each file to be flushed and closed -- the
laundromat should be able to "put" these files and allow
the filecache LRU to flush and close in the background.
ok, what should we do here, enhancing the laundromat to behave
as you describe?
What I was suggesting was a longer term strategy for improving the
laundromat. In order to scale well in the number of clients, it
needs to schedule client expiry and deletion without serializing.

(ie, the laundromat itself can identify a set of clients to clean,
but then it should pass that list to other workers so it can run
again as soon as it needs to -- and that also means it can use more
than just one CPU at a time to do its work).

I see. Currently on my lowly 1-CPU VM it takes about ~35 secs to
destroy 128 clients, each with only few states (generated by pynfs's
CID5 test). We can improve on this.

In the meantime, it is still valuable to consider a shrinker built
around using ->nfsd_courtesy_client_count only. That was one of
the two alternatives I suggested before, so I feel we have some
consensus there. Please continue to look into that, and we can
continue poking at laundromat improvements later.

Okay,

Thanks,
-Dai

Here is another stack trace of problem with calling expire_client
from 'scan' callback:
Sep  3 09:07:35 nfsvmf24 kernel: ------------[ cut here ]------------
Sep  3 09:07:35 nfsvmf24 kernel: workqueue: PF_MEMALLOC task 3525(gmain) is flushing !WQ_MEM_RECLAIM nfsd4_callbacks:0x0
Sep  3 09:07:35 nfsvmf24 kernel: WARNING: CPU: 0 PID: 3525 at kernel/workqueue.c:2625 check_flush_dependency+0x17a/0x350
Sep  3 09:07:35 nfsvmf24 kernel: Modules linked in: rpcsec_gss_krb5 nfsd nfs_acl lockd grace auth_rpcgss sunrpc
Sep  3 09:07:35 nfsvmf24 kernel: CPU: 0 PID: 3525 Comm: gmain Kdump: loaded Not tainted 6.0.0-rc3+ #1
Sep  3 09:07:35 nfsvmf24 kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Sep  3 09:07:35 nfsvmf24 kernel: RIP: 0010:check_flush_dependency+0x17a/0x350
Sep  3 09:07:35 nfsvmf24 kernel: Code: 48 c1 ee 03 0f b6 04 06 84 c0 74 08 3c 03 0f 8e b8 01 00 00 41 8b b5 90 05 00 00 49 89 e8 48 c7 c7 c0 4d 06 9a e8 c6 a4 7f 02 <0f> 0b eb 4d 65 4c 8b 2c 25 c0 6e 02 00 4c 89 ef e8 71 65 01 00 49
Sep  3 09:07:35 nfsvmf24 kernel: RSP: 0018:ffff88810c73f4e8 EFLAGS: 00010282
Sep  3 09:07:35 nfsvmf24 kernel: RAX: 0000000000000000 RBX: ffff88811129a800 RCX: 0000000000000000
Sep  3 09:07:35 nfsvmf24 kernel: RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffed10218e7e93
Sep  3 09:07:35 nfsvmf24 kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed103afc6ed2
Sep  3 09:07:35 nfsvmf24 kernel: R10: ffff8881d7e3768b R11: ffffed103afc6ed1 R12: 0000000000000000
Sep  3 09:07:35 nfsvmf24 kernel: R13: ffff88810d14cac0 R14: 000000000000000d R15: 000000000000000c
Sep  3 09:07:35 nfsvmf24 kernel: FS:  00007fa9a696c700(0000) GS:ffff8881d7e00000(0000) knlGS:0000000000000000
Sep  3 09:07:35 nfsvmf24 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  3 09:07:35 nfsvmf24 kernel: CR2: 00007fe922689c50 CR3: 000000010bdd8000 CR4: 00000000000406f0
Sep  3 09:07:35 nfsvmf24 kernel: Call Trace:
Sep  3 09:07:35 nfsvmf24 kernel: <TASK>
Sep  3 09:07:35 nfsvmf24 kernel: __flush_workqueue+0x32c/0x1350
Sep  3 09:07:35 nfsvmf24 kernel: ? lock_downgrade+0x680/0x680
Sep  3 09:07:35 nfsvmf24 kernel: ? rcu_read_lock_held_common+0xe/0xa0
Sep  3 09:07:35 nfsvmf24 kernel: ? check_flush_dependency+0x350/0x350
Sep  3 09:07:35 nfsvmf24 kernel: ? lock_release+0x485/0x720
Sep  3 09:07:35 nfsvmf24 kernel: ? __queue_work+0x3cc/0xe10
Sep  3 09:07:35 nfsvmf24 kernel: ? trace_hardirqs_on+0x2d/0x110
Sep  3 09:07:35 nfsvmf24 kernel: ? nfsd4_shutdown_callback+0xc5/0x3d0 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: nfsd4_shutdown_callback+0xc5/0x3d0 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: ? lock_release+0x485/0x720
Sep  3 09:07:35 nfsvmf24 kernel: ? nfsd4_probe_callback_sync+0x20/0x20 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: ? lock_downgrade+0x680/0x680
Sep  3 09:07:35 nfsvmf24 kernel: __destroy_client+0x5ec/0xa60 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: ? nfsd4_cb_recall_release+0x20/0x20 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: nfsd_courtesy_client_scan+0x39f/0x850 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: ? preempt_schedule_notrace+0x74/0xe0
Sep  3 09:07:35 nfsvmf24 kernel: ? client_ctl_write+0x7c0/0x7c0 [nfsd]
Sep  3 09:07:35 nfsvmf24 kernel: ? preempt_schedule_notrace_thunk+0x16/0x18
Sep  3 09:07:35 nfsvmf24 kernel: shrink_slab.constprop.0+0x30b/0x7b0
Sep  3 09:07:35 nfsvmf24 kernel: ? unregister_shrinker+0x270/0x270
Sep  3 09:07:35 nfsvmf24 kernel: shrink_node+0x54b/0xe50
Sep  3 09:07:35 nfsvmf24 kernel: try_to_free_pages+0x394/0xba0
Sep  3 09:07:35 nfsvmf24 kernel: ? reclaim_pages+0x3b0/0x3b0
Sep  3 09:07:35 nfsvmf24 kernel: __alloc_pages_slowpath.constprop.0+0x636/0x1f30
Sep  3 09:07:35 nfsvmf24 kernel: ? warn_alloc+0x120/0x120
Sep  3 09:07:35 nfsvmf24 kernel: ? __zone_watermark_ok+0x2d0/0x2d0
Sep  3 09:07:35 nfsvmf24 kernel: __alloc_pages+0x4d6/0x580
Sep  3 09:07:35 nfsvmf24 kernel: ? __alloc_pages_slowpath.constprop.0+0x1f30/0x1f30
Sep  3 09:07:35 nfsvmf24 kernel: ? find_held_lock+0x2c/0x110
Sep  3 09:07:35 nfsvmf24 kernel: ? lockdep_hardirqs_on_prepare+0x17b/0x410
Sep  3 09:07:35 nfsvmf24 kernel: ____cache_alloc.part.0+0x586/0x760
Sep  3 09:07:35 nfsvmf24 kernel: ? kmem_cache_alloc+0x193/0x290
Sep  3 09:07:35 nfsvmf24 kernel: kmem_cache_alloc+0x22f/0x290
Sep  3 09:07:35 nfsvmf24 kernel: getname_flags+0xbe/0x4e0
Sep  3 09:07:35 nfsvmf24 kernel: user_path_at_empty+0x23/0x50
Sep  3 09:07:35 nfsvmf24 kernel: inotify_find_inode+0x28/0x120
Sep  3 09:07:35 nfsvmf24 kernel: ? __fget_light+0x19b/0x210
Sep  3 09:07:35 nfsvmf24 kernel: __x64_sys_inotify_add_watch+0x160/0x290
Sep  3 09:07:35 nfsvmf24 kernel: ? __ia32_sys_inotify_rm_watch+0x170/0x170
Sep  3 09:07:35 nfsvmf24 kernel: do_syscall_64+0x3d/0x90
Sep  3 09:07:35 nfsvmf24 kernel: entry_SYSCALL_64_after_hwframe+0x46/0xb0
Sep  3 09:07:35 nfsvmf24 kernel: RIP: 0033:0x7fa9b1a77f37
Sep  3 09:07:35 nfsvmf24 kernel: Code: f0 ff ff 73 01 c3 48 8b 0d 36 7f 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 fe 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 7f 2c 00 f7 d8 64 89 01 48
Sep  3 09:07:35 nfsvmf24 kernel: RSP: 002b:00007fa9a696bb28 EFLAGS: 00000202 ORIG_RAX: 00000000000000fe
Sep  3 09:07:35 nfsvmf24 kernel: RAX: ffffffffffffffda RBX: 00005638abf5adc0 RCX: 00007fa9b1a77f37
Sep  3 09:07:35 nfsvmf24 kernel: RDX: 0000000001002fce RSI: 00005638abf58080 RDI: 000000000000000d
Sep  3 09:07:35 nfsvmf24 kernel: RBP: 00007fa9a696bb54 R08: 00007ffe7f7f1080 R09: 0000000000000000
Sep  3 09:07:35 nfsvmf24 kernel: R10: 00000000002cf948 R11: 0000000000000202 R12: 0000000000000000
Sep  3 09:07:35 nfsvmf24 kernel: R13: 00007fa9b39710e0 R14: 00005638abf5adf0 R15: 00007fa9b317c940

Another behavior about the shrinker is that the 'count' and 'scan'
callbacks are not 1-to-1, meaning the 'scan' is not called after
every 'count' callback. Not sure what criteria the shrinker use to
do the 'scan' callback but it's very late, sometimes after dozens
of 'count' callback then there is a 'scan' callback. If we rely on
the 'scan' callback then I don't think we react in time to release
the courtesy clients.

-Dai

--
Chuck Lever
--
Chuck Lever