On Thu, May 28, 2020 at 4:32 PM Khalid Aziz <khalid.aziz@xxxxxxxxxx> wrote: > > This looks good to me. I like the idea overall of controlling > aggressiveness of compaction with a single tunable for the whole > system. I wonder how an end user could arrive at what a reasonable > value would be for this based upon their workload. More comments below. > Tunables like the one this patch introduces, and similar ones like 'swappiness' will always require some experimentations from the user. > On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > > hugepages. However, on a running system, higher-order allocations can > > fail if the memory is fragmented. Linux kernel currently does on- > > demand > > compaction as we request more hugepages, but this style of compaction > > incurs very high latency. Experiments with one-time full memory > > compaction (followed by hugepage allocations) show that kernel is > > able > > to restore a highly fragmented memory state to a fairly compacted > > memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory > > as > > hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > a new tunable called 'proactiveness' which dictates bounds for > > external > > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > > maintain. > > > > The tunable is exposed through sysctl: > > /proc/sys/vm/compaction_proactiveness > > > > It takes value in range [0, 100], with a default of 20. > > Looking at the code, setting this to 100 would mean system would > continuously strive to drive level of fragmentation down to 0 which can > not be reasonable and would bog the system down. A cap lower than 100 > might be a good idea to keep kcompactd from dragging system down. > Yes, I understand that a value of 100 would be a continuous compaction storm but I still don't want to artificially cap the tunable. The interpretation of this tunable can change in future, and a range of [0, 100] seems more intuitive than, say [0, 90]. Still, I think a word of caution should be added to its documentation (admin-guide/sysctl/vm.rst). > > > > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of > > 762G total free => 98% of free memory could be allocated as > > hugepages) > > > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > > > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness > > Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness" > oops... I forgot to update the patch description. This is from the v4 patch which used sysfs but v5 switched to using sysctl. > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst > > b/Documentation/admin-guide/sysctl/vm.rst > > index 0329a4d3fa9e..e5d88cabe980 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is > > available in contiguous > > blocks where possible. This can be important for example in the > > allocation of > > huge pages although processes will also directly compact memory as > > required. > > > > +compaction_proactiveness > > +======================== > > + > > +This tunable takes a value in the range [0, 100] with a default > > value of > > +20. This tunable determines how aggressively compaction is done in > > the > > +background. Setting it to 0 disables proactive compaction. > > + > > +Note that compaction has a non-trivial system-wide impact as pages > > +belonging to different processes are moved around, which could also > > lead > > +to latency spikes in unsuspecting applications. The kernel employs > > +various heuristics to avoid wasting CPU cycles if it detects that > > +proactive compaction is not being effective. > > + > > Value of 100 would cause kcompactd to try to bring fragmentation down > to 0. If hugepages are being consumed and released continuously by the > workload, it is possible that kcompactd keeps making progress (and > hence passes the test "proactive_defer = score < prev_score ?") > continuously but can not reach a fragmentation score of 0 and hence > gets stuck in compact_zone() for a long time. Page migration for > compaction is not inexpensive. Maybe either cap the value to something > less than 100 or set a floor for wmark_low above 0. > > Some more guidance regarding the value for this tunable might be > helpful here, something along the lines of what does a value of 100 > mean in terms of how kcompactd will behave. It can then give end user a > better idea of what they are getting at what cost. You touch upon the > cost above. Just add some more details so an end user can get a better > idea of size of the cost for higher values of this tunable. > I like the idea of capping wmark_low to say, 5 to prevent admins from overloading the system. Similarly, wmark_high should be capped at say, 95 to allow tunable values below 10 to have any effect: currently such low tunable values would give wmark_high=100 which would cause proactive compaction to never get triggered. Finally, I see your concern about lack of guidance on extreme values of the tunable. I will address this in the next (v6) iteration. Thanks, Nitin