Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking

Glauber Costa <glommer@xxxxxxxxxxxxx> · Fri, 7 Jun 2013 10:10:16 +0400

On 06/07/2013 02:23 AM, Andrew Morton wrote:
> On Thu, 6 Jun 2013 16:09:16 +0400 Glauber Costa <glommer@xxxxxxxxxxxxx> wrote:
> 
>>>>> then waiting for it to complete is equivalent to calling it directly.
>>>>>
>>>> Not in this case. We are in wait-capable context (we check for this
>>>> right before we reach this), but we are not in fs capable context.
>>>>
>>>> So the reason we do this - which I tried to cover in the changelog, is
>>>> to escape from the GFP_FS limitation that our call chain has, not the
>>>> wait limitation.
>>>
>>> But that's equivalent to calling the code directly.  Look:
>>>
>>> some_fs_function()
>>> {
>>> 	lock(some-fs-lock);
>>> 	...
>>> }
>>>
>>> some_other_fs_function()
>>> {
>>> 	lock(some-fs-lock);
>>> 	alloc_pages(GFP_NOFS);
>>> 	->...
>>> 	  ->schedule_work(some_fs_function);
>>> 	    flush_scheduled_work();
>>>
>>> that flush_scheduled_work() won't complete until some_fs_function() has
>>> completed.  But some_fs_function() won't complete, because we're
>>> holding some-fs-lock.
>>>
>>
>> In my experience during this series, most of the kmem allocation here
> 
> "most"?
> 

Yes, dentrys, inodes, buffer_heads. They constitute the bulk of kmem
allocations. (Please note that I am talking about kmem allocations only)

>> will be filesystem related. This means that we will allocate that with
>> GFP_FS on.
> 
> eh?  filesystems do a tremendous amount of GFP_NOFS allocation.  
> 
> akpm3:/usr/src/25> grep GFP_NOFS fs/*/*.c|wc -l
> 898
> 

My bad, I thought one thing, wrote another. I meant GFP_FS off.

>> If we don't do anything like that, reclaim is almost
>> pointless since it will never free anything (only once here and there
>> when the allocation is not from fs).
> 
> It depends what you mean by "reclaim".  There are a lot of things which
> vmscan can do for a GFP_NOFS allocation.  Scraping clean pagecache,
> clean swapcache, well-behaved (ahem) shrinkable caches.

I mean exclusively shrinkable caches. This code is executed only when we
reach the kernel memory limit. Therefore, we know that depleting user
pages won't help. And now that we have targeted shrinking, we shrink
just the caches.

> 
>> It tend to work just fine like this. It may very well be because fs
>> people just mark everything as NOFS out of safety and we aren't *really*
>> holding any locks in common situations, but it will blow in our faces in
>> a subtle way (which none of us want).
>>
>> That said, suggestions are more than welcome.
> 
> At a minimum we should remove all the schedule_work() stuff, call the
> callback function synchronously and add
> 
> 	/* This code is full of deadlocks */
> 
> 
> Sorry, this part of the patchset is busted and needs a fundamental
> rethink.
> 
Okay, I will go back to it soon.

I am suspecting we may have no choice but to just let the shrinkers run
asynchronously, which will fail this allocation but at least save us up
to the next.

Dave Shrinkers, would you be so kind to look at this problem from the
top of your mighty filesystem knowledge and see if you have a better
suggestion ?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html