Re: [RFC PATCH 0/6] proactive kcompactd

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 22 Aug 2017 13:57:14 -0700 (PDT)

On Mon, 21 Aug 2017, Johannes Weiner wrote:

> > I think I would have liked to have seen "less proactive" :)
> > 
> > Kcompactd currently has the problem that it is MIGRATE_SYNC_LIGHT so it 
> > continues until it can defragment memory.  On a host with 128GB of memory 
> > and 100GB of it sitting in a hugetlb pool, we constantly get kcompactd 
> > wakeups for order-2 memory allocation.  The stats are pretty bad:
> > 
> > compact_migrate_scanned 2931254031294 
> > compact_free_scanned    102707804816705 
> > compact_isolated        1309145254 
> > 
> > 0.0012% of memory scanned is ever actually isolated.  We constantly see 
> > very high cpu for compaction_alloc() because kcompactd is almost always 
> > running in the background and iterating most memory completely needlessly 
> > (define needless as 0.0012% of memory scanned being isolated).
> 
> The free page scanner will inevitably wade through mostly used memory,
> but 0.0012% is lower than what systems usually have free. I'm guessing
> this is because of concurrent allocation & free cycles racing with the
> scanner? There could also be an issue with how we do partial scans.
> 

More than 90% of this system's memory is in the hugetlbfs pool so the 
freeing scanner needlessly scans over it.  Because kcompactd does 
MIGRATE_SYNC_LIGHT compaction, it doesn't stop iterating until the 
allocation is successful at pgdat->kcompactd_max_order or the migration 
and freeing scanners meet.  This is normally all memory.

Because of MIGRATE_SYNC_LIGHT, kcompactd does respect deferred compaction 
and will avoid doing compaction at all for the next 
1 << COMPACT_MAX_DEFER_SHIFT wakeups, but while the rest of userspace not 
mapping hugetlbfs memory tries to fault thp, this happens almost nonstop 
at 100% of cpu.

Although this might not be a typical configuration, it can easily be used 
to demonstrate how inefficient kcompactd behaves under load when a small 
amount of memory is free or cannot be isolated because its pinned.  
vm.extfrag_threshold isn't an adequate solution.

> Anyway, we've also noticed scalability issues with the current scanner
> on 128G and 256G machines. Even with a better efficiency - finding the
> 1% of free memory, that's still a ton of linear search space.
> 

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>