Re: [PATCH v10 29/35] memcg: per-memcg kmem shrinking

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Thu, 6 Jun 2013 15:23:15 -0700

On Thu, 6 Jun 2013 16:09:16 +0400 Glauber Costa <glommer@xxxxxxxxxxxxx> wrote:

> >>> then waiting for it to complete is equivalent to calling it directly.
> >>>
> >> Not in this case. We are in wait-capable context (we check for this
> >> right before we reach this), but we are not in fs capable context.
> >>
> >> So the reason we do this - which I tried to cover in the changelog, is
> >> to escape from the GFP_FS limitation that our call chain has, not the
> >> wait limitation.
> > 
> > But that's equivalent to calling the code directly.  Look:
> > 
> > some_fs_function()
> > {
> > 	lock(some-fs-lock);
> > 	...
> > }
> > 
> > some_other_fs_function()
> > {
> > 	lock(some-fs-lock);
> > 	alloc_pages(GFP_NOFS);
> > 	->...
> > 	  ->schedule_work(some_fs_function);
> > 	    flush_scheduled_work();
> > 
> > that flush_scheduled_work() won't complete until some_fs_function() has
> > completed.  But some_fs_function() won't complete, because we're
> > holding some-fs-lock.
> > 
> 
> In my experience during this series, most of the kmem allocation here

"most"?

> will be filesystem related. This means that we will allocate that with
> GFP_FS on.

eh?  filesystems do a tremendous amount of GFP_NOFS allocation.  

akpm3:/usr/src/25> grep GFP_NOFS fs/*/*.c|wc -l
898

> If we don't do anything like that, reclaim is almost
> pointless since it will never free anything (only once here and there
> when the allocation is not from fs).

It depends what you mean by "reclaim".  There are a lot of things which
vmscan can do for a GFP_NOFS allocation.  Scraping clean pagecache,
clean swapcache, well-behaved (ahem) shrinkable caches.

> It tend to work just fine like this. It may very well be because fs
> people just mark everything as NOFS out of safety and we aren't *really*
> holding any locks in common situations, but it will blow in our faces in
> a subtle way (which none of us want).
> 
> That said, suggestions are more than welcome.

At a minimum we should remove all the schedule_work() stuff, call the
callback function synchronously and add

	/* This code is full of deadlocks */

Sorry, this part of the patchset is busted and needs a fundamental
rethink.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html