Re: [PATCH v4 2/2] NFSD: add shrinker to reap courtesy clients on low memory condition

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Sat, 3 Sep 2022 17:29:01 +0000

> On Sep 3, 2022, at 1:03 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:
> 
> 
> On 9/2/22 6:26 PM, Chuck Lever III wrote:
>>> On Sep 2, 2022, at 3:34 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:
>>> 
>>> On 9/2/22 10:58 AM, Chuck Lever III wrote:
>>>>>> Or, nfsd_courtesy_client_count() could return
>>>>>> nfsd_couresy_client_count. nfsd_courtesy_client_scan() could
>>>>>> then look something like this:
>>>>>> 
>>>>>> 	if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
>>>>>> 		return SHRINK_STOP;
>>>>>> 
>>>>>> 	nfsd_get_client_reaplist(nn, reaplist, lt);
>>>>>> 	list_for_each_safe(pos, next, &reaplist) {
>>>>>> 		clp = list_entry(pos, struct nfs4_client, cl_lru);
>>>>>> 		trace_nfsd_clid_purged(&clp->cl_clientid);
>>>>>> 		list_del_init(&clp->cl_lru);
>>>>>> 		expire_client(clp);
>>>>>> 		count++;
>>>>>> 	}
>>>>>> 	return count;
>>>>> This does not work, we cannot expire clients on the context of
>>>>> scan callback due to deadlock problem.
>>>> Correct, we don't want to start shrinker laundromat activity if
>>>> the allocation request specified that it cannot wait. But are
>>>> you sure it doesn't work if sc_gfp_flags is set to GFP_KERNEL?
>>> It can be deadlock due to lock ordering problem:
>>> 
>>> ======================================================
>>> WARNING: possible circular locking dependency detected
>>> 5.19.0-rc2_sk+ #1 Not tainted
>>> ------------------------------------------------------
>>> lck/31847 is trying to acquire lock:
>>> ffff88811d268850 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}, at: btrfs_inode_lock+0x38/0x70
>>> #012but task is already holding lock:
>>> ffffffffb41848c0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x506/0x1db0
>>> #012which lock already depends on the new lock.
>>> #012the existing dependency chain (in reverse order) is:
>>> 
>>> #012-> #1 (fs_reclaim){+.+.}-{0:0}:
>>>       fs_reclaim_acquire+0xc0/0x100
>>>       __kmalloc+0x51/0x320
>>>       btrfs_buffered_write+0x2eb/0xd90
>>>       btrfs_do_write_iter+0x6bf/0x11c0
>>>       do_iter_readv_writev+0x2bb/0x5a0
>>>       do_iter_write+0x131/0x630
>>>       nfsd_vfs_write+0x4da/0x1900 [nfsd]
>>>       nfsd4_write+0x2ac/0x760 [nfsd]
>>>       nfsd4_proc_compound+0xce8/0x23e0 [nfsd]
>>>       nfsd_dispatch+0x4ed/0xc10 [nfsd]
>>>       svc_process_common+0xd3f/0x1b00 [sunrpc]
>>>       svc_process+0x361/0x4f0 [sunrpc]
>>>       nfsd+0x2d6/0x570 [nfsd]
>>>       kthread+0x2a1/0x340
>>>       ret_from_fork+0x22/0x30
>>> 
>>> #012-> #0 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}:
>>>       __lock_acquire+0x318d/0x7830
>>>       lock_acquire+0x1bb/0x500
>>>       down_write+0x82/0x130
>>>       btrfs_inode_lock+0x38/0x70
>>>       btrfs_sync_file+0x280/0x1010
>>>       nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
>>>       nfsd_file_put+0xd4/0x110 [nfsd]
>>>       release_all_access+0x13a/0x220 [nfsd]
>>>       nfs4_free_ol_stateid+0x40/0x90 [nfsd]
>>>       free_ol_stateid_reaplist+0x131/0x210 [nfsd]
>>>       release_openowner+0xf7/0x160 [nfsd]
>>>       __destroy_client+0x3cc/0x740 [nfsd]
>>>       nfsd_cc_lru_scan+0x271/0x410 [nfsd]
>>>       shrink_slab.constprop.0+0x31e/0x7d0
>>>       shrink_node+0x54b/0xe50
>>>       try_to_free_pages+0x394/0xba0
>>>       __alloc_pages_slowpath.constprop.0+0x5d2/0x1db0
>>>       __alloc_pages+0x4d6/0x580
>>>       __handle_mm_fault+0xc25/0x2810
>>>       handle_mm_fault+0x136/0x480
>>>       do_user_addr_fault+0x3d8/0xec0
>>>       exc_page_fault+0x5d/0xc0
>>>       asm_exc_page_fault+0x27/0x30
>> Is this deadlock possible only via the filecache close
>> paths?
> 
> I'm not sure, but there is another back trace below that shows
> another problem.
> 
>>  I can't think of a reason the laundromat has to
>> wait for each file to be flushed and closed -- the
>> laundromat should be able to "put" these files and allow
>> the filecache LRU to flush and close in the background.
> 
> ok, what should we do here, enhancing the laundromat to behave
> as you describe?

What I was suggesting was a longer term strategy for improving the
laundromat. In order to scale well in the number of clients, it
needs to schedule client expiry and deletion without serializing.

(ie, the laundromat itself can identify a set of clients to clean,
but then it should pass that list to other workers so it can run
again as soon as it needs to -- and that also means it can use more
than just one CPU at a time to do its work).

In the meantime, it is still valuable to consider a shrinker built
around using ->nfsd_courtesy_client_count only. That was one of
the two alternatives I suggested before, so I feel we have some
consensus there. Please continue to look into that, and we can
continue poking at laundromat improvements later.

> Here is another stack trace of problem with calling expire_client
> from 'scan' callback:
> Sep  3 09:07:35 nfsvmf24 kernel: ------------[ cut here ]------------
> Sep  3 09:07:35 nfsvmf24 kernel: workqueue: PF_MEMALLOC task 3525(gmain) is flushing !WQ_MEM_RECLAIM nfsd4_callbacks:0x0
> Sep  3 09:07:35 nfsvmf24 kernel: WARNING: CPU: 0 PID: 3525 at kernel/workqueue.c:2625 check_flush_dependency+0x17a/0x350
> Sep  3 09:07:35 nfsvmf24 kernel: Modules linked in: rpcsec_gss_krb5 nfsd nfs_acl lockd grace auth_rpcgss sunrpc
> Sep  3 09:07:35 nfsvmf24 kernel: CPU: 0 PID: 3525 Comm: gmain Kdump: loaded Not tainted 6.0.0-rc3+ #1
> Sep  3 09:07:35 nfsvmf24 kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> Sep  3 09:07:35 nfsvmf24 kernel: RIP: 0010:check_flush_dependency+0x17a/0x350
> Sep  3 09:07:35 nfsvmf24 kernel: Code: 48 c1 ee 03 0f b6 04 06 84 c0 74 08 3c 03 0f 8e b8 01 00 00 41 8b b5 90 05 00 00 49 89 e8 48 c7 c7 c0 4d 06 9a e8 c6 a4 7f 02 <0f> 0b eb 4d 65 4c 8b 2c 25 c0 6e 02 00 4c 89 ef e8 71 65 01 00 49
> Sep  3 09:07:35 nfsvmf24 kernel: RSP: 0018:ffff88810c73f4e8 EFLAGS: 00010282
> Sep  3 09:07:35 nfsvmf24 kernel: RAX: 0000000000000000 RBX: ffff88811129a800 RCX: 0000000000000000
> Sep  3 09:07:35 nfsvmf24 kernel: RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffed10218e7e93
> Sep  3 09:07:35 nfsvmf24 kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed103afc6ed2
> Sep  3 09:07:35 nfsvmf24 kernel: R10: ffff8881d7e3768b R11: ffffed103afc6ed1 R12: 0000000000000000
> Sep  3 09:07:35 nfsvmf24 kernel: R13: ffff88810d14cac0 R14: 000000000000000d R15: 000000000000000c
> Sep  3 09:07:35 nfsvmf24 kernel: FS:  00007fa9a696c700(0000) GS:ffff8881d7e00000(0000) knlGS:0000000000000000
> Sep  3 09:07:35 nfsvmf24 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep  3 09:07:35 nfsvmf24 kernel: CR2: 00007fe922689c50 CR3: 000000010bdd8000 CR4: 00000000000406f0
> Sep  3 09:07:35 nfsvmf24 kernel: Call Trace:
> Sep  3 09:07:35 nfsvmf24 kernel: <TASK>
> Sep  3 09:07:35 nfsvmf24 kernel: __flush_workqueue+0x32c/0x1350
> Sep  3 09:07:35 nfsvmf24 kernel: ? lock_downgrade+0x680/0x680
> Sep  3 09:07:35 nfsvmf24 kernel: ? rcu_read_lock_held_common+0xe/0xa0
> Sep  3 09:07:35 nfsvmf24 kernel: ? check_flush_dependency+0x350/0x350
> Sep  3 09:07:35 nfsvmf24 kernel: ? lock_release+0x485/0x720
> Sep  3 09:07:35 nfsvmf24 kernel: ? __queue_work+0x3cc/0xe10
> Sep  3 09:07:35 nfsvmf24 kernel: ? trace_hardirqs_on+0x2d/0x110
> Sep  3 09:07:35 nfsvmf24 kernel: ? nfsd4_shutdown_callback+0xc5/0x3d0 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: nfsd4_shutdown_callback+0xc5/0x3d0 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: ? lock_release+0x485/0x720
> Sep  3 09:07:35 nfsvmf24 kernel: ? nfsd4_probe_callback_sync+0x20/0x20 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: ? lock_downgrade+0x680/0x680
> Sep  3 09:07:35 nfsvmf24 kernel: __destroy_client+0x5ec/0xa60 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: ? nfsd4_cb_recall_release+0x20/0x20 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: nfsd_courtesy_client_scan+0x39f/0x850 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: ? preempt_schedule_notrace+0x74/0xe0
> Sep  3 09:07:35 nfsvmf24 kernel: ? client_ctl_write+0x7c0/0x7c0 [nfsd]
> Sep  3 09:07:35 nfsvmf24 kernel: ? preempt_schedule_notrace_thunk+0x16/0x18
> Sep  3 09:07:35 nfsvmf24 kernel: shrink_slab.constprop.0+0x30b/0x7b0
> Sep  3 09:07:35 nfsvmf24 kernel: ? unregister_shrinker+0x270/0x270
> Sep  3 09:07:35 nfsvmf24 kernel: shrink_node+0x54b/0xe50
> Sep  3 09:07:35 nfsvmf24 kernel: try_to_free_pages+0x394/0xba0
> Sep  3 09:07:35 nfsvmf24 kernel: ? reclaim_pages+0x3b0/0x3b0
> Sep  3 09:07:35 nfsvmf24 kernel: __alloc_pages_slowpath.constprop.0+0x636/0x1f30
> Sep  3 09:07:35 nfsvmf24 kernel: ? warn_alloc+0x120/0x120
> Sep  3 09:07:35 nfsvmf24 kernel: ? __zone_watermark_ok+0x2d0/0x2d0
> Sep  3 09:07:35 nfsvmf24 kernel: __alloc_pages+0x4d6/0x580
> Sep  3 09:07:35 nfsvmf24 kernel: ? __alloc_pages_slowpath.constprop.0+0x1f30/0x1f30
> Sep  3 09:07:35 nfsvmf24 kernel: ? find_held_lock+0x2c/0x110
> Sep  3 09:07:35 nfsvmf24 kernel: ? lockdep_hardirqs_on_prepare+0x17b/0x410
> Sep  3 09:07:35 nfsvmf24 kernel: ____cache_alloc.part.0+0x586/0x760
> Sep  3 09:07:35 nfsvmf24 kernel: ? kmem_cache_alloc+0x193/0x290
> Sep  3 09:07:35 nfsvmf24 kernel: kmem_cache_alloc+0x22f/0x290
> Sep  3 09:07:35 nfsvmf24 kernel: getname_flags+0xbe/0x4e0
> Sep  3 09:07:35 nfsvmf24 kernel: user_path_at_empty+0x23/0x50
> Sep  3 09:07:35 nfsvmf24 kernel: inotify_find_inode+0x28/0x120
> Sep  3 09:07:35 nfsvmf24 kernel: ? __fget_light+0x19b/0x210
> Sep  3 09:07:35 nfsvmf24 kernel: __x64_sys_inotify_add_watch+0x160/0x290
> Sep  3 09:07:35 nfsvmf24 kernel: ? __ia32_sys_inotify_rm_watch+0x170/0x170
> Sep  3 09:07:35 nfsvmf24 kernel: do_syscall_64+0x3d/0x90
> Sep  3 09:07:35 nfsvmf24 kernel: entry_SYSCALL_64_after_hwframe+0x46/0xb0
> Sep  3 09:07:35 nfsvmf24 kernel: RIP: 0033:0x7fa9b1a77f37
> Sep  3 09:07:35 nfsvmf24 kernel: Code: f0 ff ff 73 01 c3 48 8b 0d 36 7f 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 fe 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 7f 2c 00 f7 d8 64 89 01 48
> Sep  3 09:07:35 nfsvmf24 kernel: RSP: 002b:00007fa9a696bb28 EFLAGS: 00000202 ORIG_RAX: 00000000000000fe
> Sep  3 09:07:35 nfsvmf24 kernel: RAX: ffffffffffffffda RBX: 00005638abf5adc0 RCX: 00007fa9b1a77f37
> Sep  3 09:07:35 nfsvmf24 kernel: RDX: 0000000001002fce RSI: 00005638abf58080 RDI: 000000000000000d
> Sep  3 09:07:35 nfsvmf24 kernel: RBP: 00007fa9a696bb54 R08: 00007ffe7f7f1080 R09: 0000000000000000
> Sep  3 09:07:35 nfsvmf24 kernel: R10: 00000000002cf948 R11: 0000000000000202 R12: 0000000000000000
> Sep  3 09:07:35 nfsvmf24 kernel: R13: 00007fa9b39710e0 R14: 00005638abf5adf0 R15: 00007fa9b317c940
> 
> Another behavior about the shrinker is that the 'count' and 'scan'
> callbacks are not 1-to-1, meaning the 'scan' is not called after
> every 'count' callback. Not sure what criteria the shrinker use to
> do the 'scan' callback but it's very late, sometimes after dozens
> of 'count' callback then there is a 'scan' callback. If we rely on
> the 'scan' callback then I don't think we react in time to release
> the courtesy clients.
> 
> -Dai
> 
>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever