On Tue, 17 Sep 2019, John Hubbard wrote: > > We've had good success with periodically compacting memory on a regular > > cadence on systems with hugepages enabled. The cadence itself is defined > > by the admin but it causes khugepaged[*] to periodically wakeup and invoke > > compaction in an attempt to keep zones as defragmented as possible > > That's an important data point, thanks for reporting it. > > And given that we have at least one data point validating it, I think we > should feel fairly comfortable with this approach. Because the sys admin > probably knows when are the best times to steal cpu cycles and recover > some huge pages. Unlike the kernel, the sys admin can actually see the > future sometimes, because he/she may know what is going to be run. > > It's still sounding like we can expect excellent results from simply > defragmenting from user space, via a chron job and/or before running > important tests, rather than trying to have the kernel guess whether > it's a performance win to defragment at some particular time. > > Are you using existing interfaces, or did you need to add something? How > exactly are you triggering compaction? > It's possible to do this through a cron job but there are a fre reasons that we preferred to do it through khugepaged: - we use a lighter variation of compaction, MIGRATE_SYNC_LIGHT, than what the per-node trigger provides since compact_node() forces MIGRATE_SYNC and can stall for minutes and become disruptive under some circumstances, - we do not ignore the pageblock skip hint which compact_node() hardcodes to ignore, and - we didn't want to do this in process context so that the cpu time is not taxed to any user cgroup since it's on behalf of the system as a whole. It seems much better to do this on a per-node basis rather than through the sysctl to do it for the whole system to partition the work. Extending the per-node interface to do MIGRATE_SYNC_LIGHT and not ignore pageblock skip is possible but the work done would still be done in process context so if done from userspace this would need to be attached to a cgroup that does not tax that cgroup for usage done on behalf of the entire system. Again, we're using khugepaged and allowing the period to be defined through /sys/kernel/mm/transparent_hugepage/khugepaged but that is because we only want to do this on systems where we want to dynamically allocate hugepages on a regular basis.