Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path

Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> · Fri, 28 Aug 2015 23:32:31 +0300

On Fri, Aug 28, 2015 at 12:48:19PM -0400, Tejun Heo wrote:
...
> > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > >   skipped.  If a process performs only speculative allocations, it can
> > >   blow way past the high limit.  This is actually easily reproducible
> > >   by simply doing "find /".  VFS tries speculative !__GFP_WAIT
> > >   allocations first, so as long as there's memory which can be
> > >   consumed without blocking, it can keep allocating memory regardless
> > >   of the high limit.
> > 
> > I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> > in a row - they should still alternate with normal __GFP_WAIT
> > allocations. Yes, that means we can breach memory.high threshold for a
> > short period of time, but it isn't a hard limit, so it looks perfectly
> > fine to me.
> > 
> > I tried to run `find /` over ext4 in a cgroup with memory.high set to
> > 32M and kmem accounting enabled. With such a setup memory.current never
> > got higher than 33152K, which is only 384K greater than the memory.high.
> > Which FS did you use?
> 
> ext4.  Here, it goes onto happily consuming hundreds of megabytes with
> limit set at 32M.  We have quite a few places where !__GFP_WAIT
> allocations are performed speculatively in hot paths with fallback
> slow paths, so this is bound to happen somewhere.

What kind of workload should it be then? `find` will constantly invoke
d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
to perform reclaim...

OK, I tried to reproduce the issue on the latest mainline kernel and ...
succeeded - memory.current did occasionally jump up to ~55M although
memory.high was set to 32M. Hmm, strange... Started to investigate.
Printed stack traces and found that we don't invoke memcg reclaim on
normal GFP_KERNEL allocations! How is that? The thing is there was a
commit that made SLUB (not VFS or any other kmem user, but core SLUB)
try to allocate high order slab pages w/o __GFP_WAIT for performance
reasons. That broke kmemcg case. Here it goes:

commit 6af3142bed1f520b90f4cdb6cd10bbd16906ce9a
Author: Joonsoo Kim <js1304@xxxxxxxxx>
Date:   Tue Aug 25 00:03:52 2015 +0000

    mm/slub: don't wait for high-order page allocation

I suspect your kernel has this commit included, because w/o it I haven't
managed to catch anything nearly as bad as you describe: the memory.high
excess reached 1-2 Mb at max, but never "hundreds of megabytes". If so,
we'd better fix that instead. Actually, it's worth fixing anyway. What
about the patch below?
---
From: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
Date: Fri, 28 Aug 2015 23:17:19 +0300
Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page
 allocation

Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
w/o __GFP_WAIT in order to avoid invoking reclaim/compaction when we can
fall back on low order pages. However, it broke kmemcg/memory.high
logic. The latter works as a soft limit: an allocation won't fail if it
is breached, but we call direct reclaim to compensate the excess. W/o
__GFP_WAIT we can't invoke reclaimer and therefore we will just go on,
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.

Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let's fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.

Fixes: 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation")
Signed-off-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>

diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..1b9dbad40272 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,9 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	if (memcg_charge_slab(s, flags, order))
 		return NULL;
 
+	if ((flags & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
+		flags = (flags | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
+
 	if (node == NUMA_NO_NODE)
 		page = alloc_pages(flags, order);
 	else
@@ -1364,8 +1367,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * so we fall-back to the minimum order allocation.
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
-	if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
-		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
 
 	page = alloc_slab_page(s, alloc_gfp, node, oo);
 	if (unlikely(!page)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>