Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab

Pavel Tikhomirov <ptikhomirov@xxxxxxxxxxxxx> · Thu, 19 Dec 2019 10:35:57 +0000

On 12/10/19 4:20 AM, Dave Chinner wrote:
> On Fri, Dec 06, 2019 at 09:11:25AM -0800, Shakeel Butt wrote:
>> On Thu, Dec 5, 2019 at 6:10 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> If a shrinker is blocking for a long time, then we need to
>>> work to fix the shrinker implementation because blocking is a much
>>> bigger problem than just register/unregister.
>>>
>>
>> Yes, we should be fixing the implementations of all shrinkers and yes
>> it is bigger issue but we can also fix register/unregister isolation
>> issue in parallel. Fixing all shrinkers would a tedious and long task
>> and we should not block fixing isolation issue on it.
> 
> "fixing all shrinkers" is a bit of hyperbole - you've identified
> only one instance where blocking is causing you problems. Indeed,
> most shrinkers are already non-blocking and won't cause you any
> problems at all.
> 
>>> IOWs, we already know that cycling a global rwsem on every
>>> individual shrinker invocation is going to cause noticable
>>> scalability problems. Hence I don't think that this sort of "cycle
>>> the global rwsem faster to reduce [un]register latency" solution is
>>> going to fly because of the runtime performance regressions it will
>>> introduce....
>>>
>>
>> I agree with your scalability concern (though others would argue to
>> first demonstrate the issue before adding more sophisticated scalable
>> code).
> 
> Look at the git history. We *know* this is a problem, so anyone
> arguing that we have to prove it can go take a long walk of a short
> plank....
> 
>> Most memory reclaim code is written without the performance or
>> scalability concern, maybe we should switch our thinking.
> 
> I think there's a lot of core mm and other developers that would
> disagree with you there. With respect to shrinkers, we've been
> directly concerned about performance and scalability of the
> individual instances as well as the infrastructure for at least the
> last decade....
> 
> Cheers,
> 
> Dave.
> 

Thanks a lot for your replies, now I see that the core of the problem is 
in nfs hang, before that I was unsure if it's OK to have such a hang in 
do_shrink_slab.

I have a possibly bad idea on how my patch can still work. What if we 
use unlock/refcount way only for nfs-shrinkers? It will still give a 
performance penalty if one has many nfs mounts, but for those who has 
little number of nfs mounts the penalty would be less. And this would be 
a small isolation improvement for nfs users.

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.