On Mon, Dec 2, 2019 at 8:37 AM Andrey Ryabinin <aryabinin@xxxxxxxxxxxxx> wrote: > > > On 11/30/19 12:45 AM, Pavel Tikhomirov wrote: > > We have a problem that shrinker_rwsem can be held for a long time for > > read in shrink_slab, at the same time any process which is trying to > > manage shrinkers hangs. > > > > The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. > > It tries to shrink something on nfs (hard) but nfs server is dead at > > these moment already and rpc will never succeed. Generally any shrinker > > can take significant time to do_shrink_slab, so it's a bad idea to hold > > the list lock here. > > > > We have a similar problem in shrink_slab_memcg, except that we are > > traversing shrinker_map+shrinker_idr there. > > > > The idea of the patch is to inc a refcount to the chosen shrinker so it > > won't disappear and release shrinker_rwsem while we are in > > do_shrink_slab, after that we will reacquire shrinker_rwsem, dec > > the refcount and continue the traversal. > > > > We also need a wait_queue so that unregister_shrinker can wait for the > > refcnt to become zero. Only after these we can safely remove the > > shrinker from list and idr, and free the shrinker. > > > > I've reproduced the nfs hang in do_shrink_slab with the patch applied on > > ms kernel, all other mount/unmount pass fine without any hang. > > > > Here is a reproduction on kernel without patch: > > > > 1) Setup nfs on server node with some files in it (e.g. 200) > > > > [server]# cat /etc/exports > > /vz/nfs2 *(ro,no_root_squash,no_subtree_check,async) > > > > 2) Hard mount it on client node > > > > [client]# mount -ohard 10.94.3.40:/vz/nfs2 /mnt > > > > 3) Open some (e.g. 200) files on the mount > > > > [client]# for i in $(find /mnt/ -type f | head -n 200); \ > > do setsid sleep 1000 &>/dev/null <$i & done > > > > 4) Kill all openers > > > > [client]# killall sleep -9 > > > > 5) Put your network cable out on client node > > > > 6) Drop caches on the client, it will hang on nfs while holding > > shrinker_rwsem lock for read > > > > [client]# echo 3 > /proc/sys/vm/drop_caches > > > > crash> bt ... > > PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" > > #0 [...] __schedule at ... > > #1 [...] schedule at ... > > #2 [...] rpc_wait_bit_killable at ... [sunrpc] > > #3 [...] __wait_on_bit at ... > > #4 [...] out_of_line_wait_on_bit at ... > > #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] > > #6 [...] nfs4_proc_delegreturn at ... [nfsv4] > > #7 [...] nfs_do_return_delegation at ... [nfsv4] > > #8 [...] nfs4_evict_inode at ... [nfsv4] > > #9 [...] evict at ... > > #10 [...] dispose_list at ... > > #11 [...] prune_icache_sb at ... > > #12 [...] super_cache_scan at ... > > #13 [...] do_shrink_slab at ... > > #14 [...] shrink_slab at ... > > #15 [...] drop_slab_node at ... > > #16 [...] drop_slab at ... > > #17 [...] drop_caches_sysctl_handler at ... > > #18 [...] proc_sys_call_handler at ... > > #19 [...] vfs_write at ... > > #20 [...] ksys_write at ... > > #21 [...] do_syscall_64 at ... > > #22 [...] entry_SYSCALL_64_after_hwframe at ... > > > > 7) All other mount/umount activity now hangs with no luck to take > > shrinker_rwsem for write. > > > > [client]# mount -t tmpfs tmpfs /tmp > > > > crash> bt ... > > PID: 5464 TASK: ... CPU: 3 COMMAND: "mount" > > #0 [...] __schedule at ... > > #1 [...] schedule at ... > > #2 [...] rwsem_down_write_slowpath at ... > > #3 [...] prealloc_shrinker at ... > > #4 [...] alloc_super at ... > > #5 [...] sget at ... > > #6 [...] mount_nodev at ... > > #7 [...] legacy_get_tree at ... > > #8 [...] vfs_get_tree at ... > > #9 [...] do_mount at ... > > #10 [...] ksys_mount at ... > > #11 [...] __x64_sys_mount at ... > > #12 [...] do_syscall_64 at ... > > #13 [...] entry_SYSCALL_64_after_hwframe at ... > > > > > I don't think this patch solves the problem, it only fixes one minor symptom of it. > The actual problem here the reclaim hang in the nfs. > It means that any process, including kswapd, may go into nfs inode reclaim and stuck there. > > Even mount() itself has GFP_KERNEL allocations in its path, so it still might stuck there even with your patch. > > I think this should be handled on nfs/vfs level by making inode eviction during reclaim more asynchronous. Though I agree that we should be fixing shrinkers to not get stuck (and be more async), I still think the problem this patch is solving is worth fixing. On machines running multiple workloads, one job stuck in slab shrinker and blocking all other unrelated jobs wanting shrinker_rwsem, breaks the isolation and causes DoS. Shakeel