Re: isolate_freepages_block and excessive CPU usage by OSD process

Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> · Fri, 28 Nov 2014 17:03:31 +0900

On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
> On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@xxxxxxxxx> wrote:
> > Here's an update:
> >
> > Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> > Ceph brings no perceived improvement over the chassis running 3.10 kernels.
> >
> > Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
> >
> > http://ponies.io/raw/cluster.png
> >
> > It is perhaps faring a little better that those chassis running the 3.10 in
> > that it did not have min_free_kbytes raised to 2GB as the others did, instead
> > it was sitting around 90MB.
> >
> > The perf recording did look a little different. Not sure if this was just the
> > luck of the draw in how the fractal rendering works:
> >
> > http://ponies.io/raw/perf-3.10.png
> >
> > Any pointers on how we can track this down? There's at least three of us
> > following at this now so we should have plenty of area to test.
> 
> 
> Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
> is presented for single- and two-headed systems as well. Ceph-users
> reported presence of the problem for 3.17, so probably we are facing
> generic compaction issue.
> 

Hello,

I didn't follow-up this discussion, but, at glance, this excessive CPU
usage by compaction is related to following fixes.

Could you test following two patches?

If these fixes your problem, I will resumit patches with proper commit
description.

Thanks.

-------->8-------------
>From 079f3f119f1e3cbe9d981e7d0cada94e0c532162 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
Date: Fri, 28 Nov 2014 16:36:00 +0900
Subject: [PATCH 1/2] mm/compaction: fix wrong order check in
 compact_finished()

What we want to check here is whether there is highorder freepage
in buddy list of other migratetype in order to steal it without
fragmentation. But, current code just checks cc->order which means
allocation request order. So, this is wrong.

Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction complete, because migratetype of most
pageblocks are movable and cc->order is always below than pageblock order
in this case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
---
 mm/compaction.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b544d61..052194f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1082,7 +1082,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
-		if (cc->order >= pageblock_order && area->nr_free)
+		if (order >= pageblock_order && area->nr_free)
 			return COMPACT_PARTIAL;
 	}
 
-- 
1.7.9.5



-------->8-------------
>From e3a5280747c4d0d12c67ad83f0f3dc5dce0ff11e Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
Date: Fri, 28 Nov 2014 16:44:30 +0900
Subject: [PATCH 2/2] mm/page_alloc: don't do heavy compaction if we have a
 fallback method

SLUB sometimes uses high order allocation for allocating the slab
to reduce fragmentation. But, it has fallback method because high order
allocation would be hard to succeed and it also have a big impact
on the system performance. But, current allocation logic in page allocator
cannot filter out that request properly and high order request from SLUB
invokes synchronous compaction which is really heavy hammer. SLUB
would work well without high order allocation so this patch filter
out that request. At my quick grab, other allocation requests with
these gfp flags also have fallback method, but, I don't know whether
all of them have it or not. But, __GFP_NOWARN + __GFP_NORETRY checks
looks reasonable to avoid heavy hammer.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
---
 mm/page_alloc.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10310ad..e719f79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2829,6 +2829,14 @@ rebalance:
 		goto rebalance;
 	} else {
 		/*
+		 * Certain gfp_mask notifies that allocation requestor
+		 * has proper fallback method, so we can stop the hard work.
+		 * See mm/slub.c for example.
+		 */
+		if (gfp_mask & __GFP_NORETRY && gfp_mask & __GFP_NOWARN)
+			goto nopage;
+
+		/*
 		 * High-order allocations do not necessarily loop after
 		 * direct reclaim and reclaim/compaction depends on compaction
 		 * being called after reclaim so call directly if necessary
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>