Re: [RFC] mm: Proactive compaction

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 17 Sep 2019 13:26:52 -0700 (PDT)

On Tue, 17 Sep 2019, John Hubbard wrote:

> > We've had good success with periodically compacting memory on a regular 
> > cadence on systems with hugepages enabled.  The cadence itself is defined 
> > by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> > compaction in an attempt to keep zones as defragmented as possible 
> 
> That's an important data point, thanks for reporting it. 
> 
> And given that we have at least one data point validating it, I think we
> should feel fairly comfortable with this approach. Because the sys admin 
> probably knows  when are the best times to steal cpu cycles and recover 
> some huge pages. Unlike the kernel, the sys admin can actually see the 
> future sometimes, because he/she may know what is going to be run.
> 
> It's still sounding like we can expect excellent results from simply 
> defragmenting from user space, via a chron job and/or before running
> important tests, rather than trying to have the kernel guess whether 
> it's a performance win to defragment at some particular time.
> 
> Are you using existing interfaces, or did you need to add something? How
> exactly are you triggering compaction?
> 

It's possible to do this through a cron job but there are a fre reasons 
that we preferred to do it through khugepaged:

 - we use a lighter variation of compaction, MIGRATE_SYNC_LIGHT, than what 
   the per-node trigger provides since compact_node() forces MIGRATE_SYNC
   and can stall for minutes and become disruptive under some
   circumstances,

 - we do not ignore the pageblock skip hint which compact_node() hardcodes 
   to ignore, and 

 - we didn't want to do this in process context so that the cpu time is
   not taxed to any user cgroup since it's on behalf of the system as a
   whole.

It seems much better to do this on a per-node basis rather than through 
the sysctl to do it for the whole system to partition the work.  Extending 
the per-node interface to do MIGRATE_SYNC_LIGHT and not ignore pageblock 
skip is possible but the work done would still be done in process context 
so if done from userspace this would need to be attached to a cgroup that 
does not tax that cgroup for usage done on behalf of the entire system.

Again, we're using khugepaged and allowing the period to be defined 
through /sys/kernel/mm/transparent_hugepage/khugepaged but that is because 
we only want to do this on systems where we want to dynamically allocate 
hugepages on a regular basis.