On 12/4/19 11:35 AM, Michal Hocko wrote: > On Sat 30-11-19 00:45:41, Pavel Tikhomirov wrote: >> We have a problem that shrinker_rwsem can be held for a long time for >> read in shrink_slab, at the same time any process which is trying to >> manage shrinkers hangs. >> >> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list. >> It tries to shrink something on nfs (hard) but nfs server is dead at >> these moment already and rpc will never succeed. Generally any shrinker >> can take significant time to do_shrink_slab, so it's a bad idea to hold >> the list lock here. > > Yes, this is a known problem and people have already tried to address it > in the past. Have you checked previous attempts? SRCU based one > http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain > but I believe there were others (I only had this one in my notes). > Please make sure to Cc Dave Chinner when posting a next version because > he had some concerns about the change of the behavior. The approach of the patch you are referencing is quiet different, it will still hold global srcu_read_lock(&srcu) when diving in do_shrink_slab and hanging nfs will still block all [un]register_shrinker. > >> We have a similar problem in shrink_slab_memcg, except that we are >> traversing shrinker_map+shrinker_idr there. >> >> The idea of the patch is to inc a refcount to the chosen shrinker so it >> won't disappear and release shrinker_rwsem while we are in >> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec >> the refcount and continue the traversal. > > The reference count part makes sense to me. RCU role needs a better > explanation. I have 2 rcu's in patch, 1-st is to protect shrinker_map same as it was before in memcg_set_shrinker_bit, 2-nd is to protect shrinker struct in put_shrinker from being freed, as unregister_shrinker can see refcnt==0 without actually going to schedule(). > Also do you have any reason to not use completion for > the final step? Openconding essentially the same concept sounds a bit > awkward to me. Thanks for a good hint, from the first glance we can rework wait_event part to wait_for_completion. > >> We also need a wait_queue so that unregister_shrinker can wait for the >> refcnt to become zero. Only after these we can safely remove the >> shrinker from list and idr, and free the shrinker. > [...] >> crash> bt ... >> PID: 18739 TASK: ... CPU: 3 COMMAND: "bash" >> #0 [...] __schedule at ... >> #1 [...] schedule at ... >> #2 [...] rpc_wait_bit_killable at ... [sunrpc] >> #3 [...] __wait_on_bit at ... >> #4 [...] out_of_line_wait_on_bit at ... >> #5 [...] _nfs4_proc_delegreturn at ... [nfsv4] >> #6 [...] nfs4_proc_delegreturn at ... [nfsv4] >> #7 [...] nfs_do_return_delegation at ... [nfsv4] >> #8 [...] nfs4_evict_inode at ... [nfsv4] >> #9 [...] evict at ... >> #10 [...] dispose_list at ... >> #11 [...] prune_icache_sb at ... >> #12 [...] super_cache_scan at ... >> #13 [...] do_shrink_slab at ... > > Are NFS people aware of this? Because this is simply not acceptable > behavior. Memory reclaim cannot be block indefinitely or for a long > time. There must be a way to simply give up if the underlying inode > cannot be reclaimed. Sorry that I didn't cc nfs people from the begining. > > I still have to think about the proposed solution. It sounds a bit over > complicated to me. > -- Best regards, Tikhomirov Pavel Software Developer, Virtuozzo.