On Thu, Jun 12, 2014 at 04:02:04PM +0200, Vlastimil Babka wrote: > On 06/12/2014 01:49 AM, Minchan Kim wrote: > >On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote: > >>On 06/11/2014 03:10 AM, Minchan Kim wrote: > >>>On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote: > >>>>Async compaction aborts when it detects zone lock contention or need_resched() > >>>>is true. David Rientjes has reported that in practice, most direct async > >>>>compactions for THP allocation abort due to need_resched(). This means that a > >>>>second direct compaction is never attempted, which might be OK for a page > >>>>fault, but hugepaged is intended to attempt a sync compaction in such case and > >>>>in these cases it won't. > >>>> > >>>>This patch replaces "bool contended" in compact_control with an enum that > >>>>distinguieshes between aborting due to need_resched() and aborting due to lock > >>>>contention. This allows propagating the abort through all compaction functions > >>>>as before, but declaring the direct compaction as contended only when lock > >>>>contantion has been detected. > >>>> > >>>>As a result, hugepaged will proceed with second sync compaction as intended, > >>>>when the preceding async compaction aborted due to need_resched(). > >>> > >>>You said "second direct compaction is never attempted, which might be OK > >>>for a page fault" and said "hugepagd is intented to attempt a sync compaction" > >>>so I feel you want to handle khugepaged so special unlike other direct compact > >>>(ex, page fault). > >> > >>Well khugepaged is my primary concern, but I imagine there are other > >>direct compaction users besides THP page fault and khugepaged. > >> > >>>By this patch, direct compaction take care only lock contention, not rescheduling > >>>so that pop questions. > >>> > >>>Is it okay not to consider need_resched in direct compaction really? > >> > >>It still considers need_resched() to back of from async compaction. > >>It's only about signaling contended_compaction back to > >>__alloc_pages_slowpath(). There's this code executed after the > >>first, async compaction fails: > >> > >>/* > >> * It can become very expensive to allocate transparent hugepages at > >> * fault, so use asynchronous memory compaction for THP unless it is > >> * khugepaged trying to collapse. > >> */ > >>if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD)) > >> migration_mode = MIGRATE_SYNC_LIGHT; > >> > >>/* > >> * If compaction is deferred for high-order allocations, it is because > >> * sync compaction recently failed. In this is the case and the caller > >> * requested a movable allocation that does not heavily disrupt the > >> * system then fail the allocation instead of entering direct reclaim. > >> */ > >>if ((deferred_compaction || contended_compaction) && > >> (gfp_mask & __GFP_NO_KSWAPD)) > >> goto nopage; > >> > >>Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first > >>if() decides whether the second attempt will be sync (for > >>khugepaged) or async (page fault). The second if() decides that if > >>compaction was contended, then there won't be any second attempt > >>(and reclaim) at all. Counting need_resched() as contended in this > >>case is bad for khugepaged. Even for page fault it means no direct > > > >I agree khugepaged shouldn't count on need_resched, even lock contention > >because it was a result from admin's decision. > >If it hurts system performance, he should adjust knobs for khugepaged. > > > >>reclaim and a second async compaction. David says need_resched() > >>occurs so often then it is a poor heuristic to decide this. > > > >But page fault is a bit different. Inherently, high-order allocation > >(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller > >shoud keep in mind that and prepare second plan(ex, 4K allocation) > >so direct reclaim/compaction should take care of latency rather than > >success ratio. > > Yes it's a rather delicate balance. But the plan is now to try > balance this differently than using need_resched. > > >If need_resched in second attempt(ie, synchronous compaction) is almost > >true, it means the process consumed his timeslice so it shouldn't be > >greedy and gives a CPU resource to others. > > Synchronous compaction uses cond_resched() so that's fine I think? Sorry for being not clear. I post for the clarification before taking a rest in holiday. :) When THP page fault occurs and found rescheduling while doing async direct compaction, it goes "nopage" and fall-backed to 4K page. It's good to me. Another topic: I couldn't find any cond_resched. Anyway, it could be another patch. >From a4b7c288d8de670adbc45c85991ed3bef31e4e16 Mon Sep 17 00:00:00 2001 From: Minchan Kim <minchan@xxxxxxxxxx> Date: Fri, 13 Jun 2014 10:59:26 +0900 Subject: [PATCH] mm: call cond_resched right before failing compaction David reported in many case of direct compaction for THP page fault is failed since the async compaction was abort by need_resched. It's okay because THP could be fallback to 4K page but the problem is if need_resched is true, we should give a chance to next process to schedul in for the latency so that we are not greedy any more. Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> --- mm/page_alloc.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4f59fa2..1ac5133 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2617,8 +2617,16 @@ rebalance: * system then fail the allocation instead of entering direct reclaim. */ if ((deferred_compaction || contended_compaction) && - (gfp_mask & __GFP_NO_KSWAPD)) + (gfp_mask & __GFP_NO_KSWAPD)) { + /* + * When THP page fault occurs in large memory system, + * contended_compaction is likely to be true by need_resched + * checking so let's schedule right before returning NULL page. + * That makes I'm not greedy! + */ + cond_resched(); goto nopage; + } /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, -- 2.0.0 With your change(ie, direct compaction is only aware of lock contetion, not need_resched), when THP page fault occurs and it found rescheduling while doing async direct compaction, it goes *direct reclaim path*, not "nopage" and async direct compaction again and then finally nopage. I think you are changing the behavior heavily to increase latency, which is not what direct reclaim path want even though I have no data. So, what I want is following as. It is based on previoius inline patch. --- mm/page_alloc.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1ac5133..8a4480e5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2624,8 +2624,17 @@ rebalance: * checking so let's schedule right before returning NULL page. * That makes I'm not greedy! */ - cond_resched(); - goto nopage; + int ret = cond_resched(); + + /* When THP page fault, we want to bail out for the latency */ + if (!(current->flags & PF_KTHREAD) || !ret) + goto nopage; + + /* + * I'm khugepaged and took a rest so want to try compaction + * with synchronous rather than giving up easily. + */ + WARN_ON(migration_mode == MIGRATE_ASYNC); } /* Try direct reclaim and then allocating */ -- 2.0.0 I'm off from now on. :) > > >I don't mean we should abort but the process could sleep and retry. > >The point is that we should give latency pain to the process request > >high-order alocation, not another random process. > > So basically you are saying that there should be cond_resched() also > for async compaction when need_resched() is true? Now need_resched() > is a trigger to back off rather quickly all the way back to > __alloc_pages_direct_compact() which does contain a cond_resched(). > So there should be a yield before retry. Or are you worried that the > back off is not quick enough and it shoudl cond_resched() > immediately? > > >IMHO, if we want to increase high-order alloc ratio in page fault, > >kswapd should be more aggressive than now via feedback loop from > >fail rate from direct compaction. > > Recently I think we have been rather decreasing high-order alloc > ratio in page fault :) But (at least for the THP) page fault > allocation attempts contain __GFP_NO_KSWAPD, so there's no feedback > loop. I guess changing that would be rather disruptive. > > >> > >>>We have taken care of it in direct reclaim path so why direct compaction is > >>>so special? > >> > >>I admit I'm not that familiar with reclaim but I didn't quickly find > >>any need_resched() there? There's plenty of cond_resched() but that > >>doesn't mean it will abort? Could you explain for me? > > > >I meant cond_resched. > > > >> > >>>Why does khugepaged give up easily if lock contention/need_resched happens? > >>>khugepaged is important for success ratio as I read your description so IMO, > >>>khugepaged should do synchronously without considering early bail out by > >>>lock/rescheduling. > >> > >>Well a stupid answer is that's how __alloc_pages_slowpath() works :) > >>I don't think it's bad to try using first a more lightweight > >>approach before trying the heavyweight one. As long as the > >>heavyweight one is not skipped for khugepaged. > > > >I'm not saying current two-stage trying is bad. My stand is that we should > >take care of need_resched and shouldn't become a greedy but khugepaged would > >be okay. > > > >> > >>>If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs, > >>>which is exactly the knob for that cases. > >>> > >>>So, my point is how about making khugepaged doing always dumb synchronous > >>>compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE? > >>> > >>>> > >>>>Reported-by: David Rientjes <rientjes@xxxxxxxxxx> > >>>>Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > >>>>Cc: Minchan Kim <minchan@xxxxxxxxxx> > >>>>Cc: Mel Gorman <mgorman@xxxxxxx> > >>>>Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> > >>>>Cc: Michal Nazarewicz <mina86@xxxxxxxxxx> > >>>>Cc: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> > >>>>Cc: Christoph Lameter <cl@xxxxxxxxx> > >>>>Cc: Rik van Riel <riel@xxxxxxxxxx> > >>>>--- > >>>> mm/compaction.c | 20 ++++++++++++++------ > >>>> mm/internal.h | 15 +++++++++++---- > >>>> 2 files changed, 25 insertions(+), 10 deletions(-) > >>>> > >>>>diff --git a/mm/compaction.c b/mm/compaction.c > >>>>index b73b182..d37f4a8 100644 > >>>>--- a/mm/compaction.c > >>>>+++ b/mm/compaction.c > >>>>@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc, > >>>> } > >>>> #endif /* CONFIG_COMPACTION */ > >>>> > >>>>-static inline bool should_release_lock(spinlock_t *lock) > >>>>+enum compact_contended should_release_lock(spinlock_t *lock) > >>>> { > >>>>- return need_resched() || spin_is_contended(lock); > >>>>+ if (need_resched()) > >>>>+ return COMPACT_CONTENDED_SCHED; > >>>>+ else if (spin_is_contended(lock)) > >>>>+ return COMPACT_CONTENDED_LOCK; > >>>>+ else > >>>>+ return COMPACT_CONTENDED_NONE; > >>>> } > >>>> > >>>> /* > >>>>@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock) > >>>> static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, > >>>> bool locked, struct compact_control *cc) > >>>> { > >>>>- if (should_release_lock(lock)) { > >>>>+ enum compact_contended contended = should_release_lock(lock); > >>>>+ > >>>>+ if (contended) { > >>>> if (locked) { > >>>> spin_unlock_irqrestore(lock, *flags); > >>>> locked = false; > >>>>@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags, > >>>> > >>>> /* async aborts if taking too long or contended */ > >>>> if (cc->mode == MIGRATE_ASYNC) { > >>>>- cc->contended = true; > >>>>+ cc->contended = contended; > >>>> return false; > >>>> } > >>>> > >>>>@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc) > >>>> /* async compaction aborts if contended */ > >>>> if (need_resched()) { > >>>> if (cc->mode == MIGRATE_ASYNC) { > >>>>- cc->contended = true; > >>>>+ cc->contended = COMPACT_CONTENDED_SCHED; > >>>> return true; > >>>> } > >>>> > >>>>@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order, > >>>> VM_BUG_ON(!list_empty(&cc.freepages)); > >>>> VM_BUG_ON(!list_empty(&cc.migratepages)); > >>>> > >>>>- *contended = cc.contended; > >>>>+ /* We only signal lock contention back to the allocator */ > >>>>+ *contended = cc.contended == COMPACT_CONTENDED_LOCK; > >>>> return ret; > >>>> } > >>>> > >>>>diff --git a/mm/internal.h b/mm/internal.h > >>>>index 7f22a11f..4659e8e 100644 > >>>>--- a/mm/internal.h > >>>>+++ b/mm/internal.h > >>>>@@ -117,6 +117,13 @@ extern int user_min_free_kbytes; > >>>> > >>>> #if defined CONFIG_COMPACTION || defined CONFIG_CMA > >>>> > >>>>+/* Used to signal whether compaction detected need_sched() or lock contention */ > >>>>+enum compact_contended { > >>>>+ COMPACT_CONTENDED_NONE = 0, /* no contention detected */ > >>>>+ COMPACT_CONTENDED_SCHED, /* need_sched() was true */ > >>>>+ COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */ > >>>>+}; > >>>>+ > >>>> /* > >>>> * in mm/compaction.c > >>>> */ > >>>>@@ -144,10 +151,10 @@ struct compact_control { > >>>> int order; /* order a direct compactor needs */ > >>>> int migratetype; /* MOVABLE, RECLAIMABLE etc */ > >>>> struct zone *zone; > >>>>- bool contended; /* True if a lock was contended, or > >>>>- * need_resched() true during async > >>>>- * compaction > >>>>- */ > >>>>+ enum compact_contended contended; /* Signal need_sched() or lock > >>>>+ * contention detected during > >>>>+ * compaction > >>>>+ */ > >>>> }; > >>>> > >>>> unsigned long > >>>>-- > >>>>1.8.4.5 > >>>> > >>>>-- > >>>>To unsubscribe, send a message with 'unsubscribe linux-mm' in > >>>>the body to majordomo@xxxxxxxxx. For more info on Linux MM, > >>>>see: http://www.linux-mm.org/ . > >>>>Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > >>> > >> > >>-- > >>To unsubscribe, send a message with 'unsubscribe linux-mm' in > >>the body to majordomo@xxxxxxxxx. For more info on Linux MM, > >>see: http://www.linux-mm.org/ . > >>Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>