On Sun, 9 May 2021, Andrew Morton wrote: > > Currently the proactive compaction order is fixed to > > COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of > > normal 4KB memory, but it's too high for the machines with small > > normal memory, for example the machines with most memory configured > > as 1GB hugetlbfs huge pages. In these machines the max order of > > free pages is often below 9, and it's always below 9 even with hard > > compaction. This will lead to proactive compaction be triggered very > > frequently. In these machines we only care about order of 3 or 4. > > This patch export the oder to proc and let it configurable > > by user, and the default value is still COMPACTION_HPAGE_ORDER. > > It would be great to do this automatically? It's quite simple to see > when memory is being handed out to hugetlbfs - so can we tune > proactive_compaction_order in response to this? That would be far > better than adding a manual tunable. > > But from having read Khalid's comments, that does sound quite involved. > Is there some partial solution that we can come up with that will get > most people out of trouble? > > That being said, this patch is super-super-simple so perhaps we should > just merge it just to get one person (and hopefully a few more) out of > trouble. But on the other hand, once we add a /proc tunable we must > maintain that tunable for ever (or at least a very long time) even if > the internal implementations change a lot. > As mentioned in v3 of the patch, I'm not sure why this belongs in the kernel at all. I understand that the system is largely consumed by 1GB gigantic pages and that a small percentage of memory is left for native pages. Thus, fragmentation readily occurs and can affect large order allocations even at the levels of order-3 or order-4. So it seems like the ideal solution would be to monitor the fragmentation index at the order you care about (the same order you would use for this new tunable) and root userspace would manually trigger compaction when necessary. When this was brought up, it was commented that explicitly triggered compaction is too expensive to do all in one iteration. That's fair enough, but shouldn't that be an improvement on explicitly triggered compaction through sysfs to provide a shorter term (or weaker form) of compaction rather than build additional policy decisions into the kernel? If done this way, there would be a clear separation between mechanism and policy and the kernel would not need to carry these sysctls to tune very niche areas.