On Thu, 2019-09-12 at 17:11 +0530, Bharath Vedartham wrote: > Hi Nitin, > On Wed, Sep 11, 2019 at 10:33:39PM +0000, Nitin Gupta wrote: > > On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote: > > > On Tue 10-09-19 22:27:53, Nitin Gupta wrote: > > > [...] > > > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote: > > > > > > For some applications we need to allocate almost all memory as > > > > > > hugepages. > > > > > > However, on a running system, higher order allocations can fail if > > > > > > the > > > > > > memory is fragmented. Linux kernel currently does on-demand > > > > > > compaction > > > > > > as we request more hugepages but this style of compaction incurs > > > > > > very > > > > > > high latency. Experiments with one-time full memory compaction > > > > > > (followed by hugepage allocations) shows that kernel is able to > > > > > > restore a highly fragmented memory state to a fairly compacted > > > > > > memory > > > > > > state within <1 sec for a 32G system. Such data suggests that a > > > > > > more > > > > > > proactive compaction can help us allocate a large fraction of > > > > > > memory > > > > > > as hugepages keeping allocation latencies low. > > > > > > > > > > > > In general, compaction can introduce unexpected latencies for > > > > > > applications that don't even have strong requirements for > > > > > > contiguous > > > > > > allocations. > > > > > > Could you expand on this a bit please? Gfp flags allow to express how > > > much the allocator try and compact for a high order allocations. Hugetlb > > > allocations tend to require retrying and heavy compaction to succeed and > > > the success rate tends to be pretty high from my experience. Why that > > > is not case in your case? > > > > The link to the driver you send on gitlab is not working :( Sorry about that, here's the correct link: https://gitlab.com/nigupta/linux/snippets/1894161 > > Yes, I have the same observation: with `GFP_TRANSHUGE | > > __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM > > allocated as hugepages). However, what I'm trying to point out is that > > this > > high success rate comes with high allocation latencies (90th percentile > > latency of 2206us). On the same system, the same high-order allocations > > which hit the fast path have latency <5us. > > > > > > > > It is also hard to efficiently determine if the current > > > > > > system state can be easily compacted due to mixing of unmovable > > > > > > memory. Due to these reasons, automatic background compaction by > > > > > > the > > > > > > kernel itself is hard to get right in a way which does not hurt > > > > > > unsuspecting > > > > > applications or waste CPU cycles. > > > > > > > > > > We do trigger background compaction on a high order pressure from > > > > > the > > > > > page allocator by waking up kcompactd. Why is that not sufficient? > > > > > > > > > > > > > Whenever kcompactd is woken up, it does just enough work to create > > > > one free page of the given order (compaction_control.order) or higher. > > > > > > This is an implementation detail IMHO. I am pretty sure we can do a > > > better auto tuning when there is an indication of a constant flow of > > > high order requests. This is no different from the memory reclaim in > > > principle. Just because the kswapd autotuning not fitting with your > > > particular workload you wouldn't want to export direct reclaim > > > functionality and call it from a random module. That is just doomed to > > > fail because different subsystems in control just leads to decisions > > > going against each other. > > > > > > > I don't want to go the route of adding any auto-tuning/perdiction code to > > control compaction in the kernel. I'm more inclined towards extending > > existing interfaces to allow compaction behavior to be controlled either > > from userspace or a kernel driver. Letting a random module control > > compaction or a root process pumping new tunables from sysfs is the same > > in > > principle. > Do you think a kernel module and root user process have the same > privileges? You can only export so much info to sysfs to use? Also > wouldn't this introduce more tunables, per driver tunables to be more > specific? - sysfs is a narrow interface to kernel functions. Not much different from a narrow interface I'm exporting, to be used directly by drivers which can themselves export sysfs/debugfs nodes. - There are no per driver tunables here. > > This patch is in the spirit of simple extension to existing > > compaction_zone_order() which allows either a kernel driver or userspace > > (through sysfs) to control compaction. > > > > Also, we should avoid driving hard parallels between reclaim and > > compaction: the former is often necessary for forward progress while the > > latter is often an optimization. Since contiguous allocations are mostly > > optimizations it's good to expose hooks from the kernel that let user > > (through a driver or userspace) control it using its own heuristics. > How is compaction an optimization? If I have a memory zone which has > memory pages more than zone_highwmark and if higher order allocations a > re failing because the memory is awfully fragmented, We need compaction > to furthur progress here. I have seen workloads where kswapd won't help > in progressing furthur because memory is so awfully fragmented. > The workload I am quoting is the thpscale_workload from Mel Gorman's mmtests > workloads. - You can usually (but not always) fallback to base pages in case higher-order alloc fails. Higher order allocs are for reducing TLB pressure and for devices that cannot handle non-contiguous physical regions. - kswapd is for memory reclaim only and cannot help with fragmentation. - THP itself is an optimization and can be turned off. > > I thought hard about whats lacking in current userspace interface (sysfs): > > - /proc/sys/vm/compact_memory: full system compaction is not an option as > > a viable pro-active compaction strategy. > Don't we have a sysfs interface to compact memory per node through > /sys/devices/system/node/node<node_number>/compact? CONFIG_SYSFS AND > CONFIG_NUMA are enabled on a lot of systems? Why are we not talking > about this? > I don't think kcompactd can go finer grain than per node. per-zone is > an option but then that would be overkill I feel. I want pro-active compaction to somewhat hide higher-order allocation latencies. Even full node compaction is too coase for this purpose. The goal is to keep fragmentation in check i.e, within certain thresholds. > > - possibly expose [low, high] threshold values for each node and let > > kcompactd act on them. This was my approach for my original patch I > > linked earlier. Problem here is that it introduces too many tunables. > > > > Considering the above, I came up with this callback approach which make it > > trivial to introduce user specific policies for compaction. It puts the > > onus of system stability, responsive in the hands of user without > > burdening > > admins with more tunables or adding crystal balls to kernel. > I have the same question as Michal, that is won't this cause conflicts > among different subsystems? If you did answer it in your previous > mails, could you point to as I may have missed it :) There is no big harm if multiple drivers call compact_zone_order(). A reasonable driver would want to call this interface to compact memory to a certain extent and under specific conditions. If another driver call it in parallel then other driver would simply see a well compacted state and back-off. It's also not hard for a driver to see if compaction is not helping much, where it can again back-off. > > > > Such a design causes very high latency for workloads where we want > > > > to allocate lots of hugepages in short period of time. With pro-active > > > > compaction we can hide much of this latency. For some more background > > > > discussion and data, please see this thread: > > > > > > > > https://patchwork.kernel.org/patch/11098289/ > > > > > > I am aware of that thread. And there are two things. You claim the > > > allocation success rate is unnecessarily lower and that the direct > > > latency is high. You simply cannot assume both low latency and high > > > success rate. Compaction is not free. Somebody has to do the work. > > > Hiding it into the background means that you are eating a lot of cycles > > > from everybody else (think of a workload running in a restricted cpu > > > controller just doing a lot of work in an unaccounted context). > > > > > > That being said you really have to be prepared to pay a price for > > > precious resource like high order pages. > > > > > > On the other hand I do understand that high latency is not really > > > desired for a more optimistic allocation requests with a reasonable > > > fallback strategy. Those would benefit from kcompactd not giving up too > > > early. > > > > Doing pro-active compaction in background has merits in reducing reducing > > high-order alloc latency. Its true that it would end up burning cycles > > with > > little benefit in some cases. Its upto the user of this new interface to > > back off if it detects such a case. > Are these cycles worth considering in the big picture of reducing high > order allocation latency? Yes, I think it's worth it. > > > > > > > > > Even with these caveats, pro-active compaction can still be very > > > > > > useful in certain scenarios to reduce hugepage allocation > > > > > > latencies. > > > > > > This callback interface allows drivers to drive compaction based > > > > > > on > > > > > > their own policies like the current level of external > > > > > > fragmentation > > > > > > for a particular order, system load etc. > > > > > > > > > > So we do not trust the core MM to make a reasonable decision while > > > > > we > > > > > give > > > > > a free ticket to modules. How does this make any sense at all? How > > > > > is a > > > > > random module going to make a more informed decision when it has > > > > > less > > > > > visibility on the overal MM situation. > > > > > > > > > > > > > Embedding any specific policy (like: keep external fragmentation for > > > > order-9 > > > > between 30-40%) within MM core looks like a bad idea. > > > > > > Agreed > > > > > > > As a driver, we > > > > can easily measure parameters like system load, current fragmentation > > > > level > > > > for any order in any zone etc. to make an informed decision. > > > > See the thread I refereed above for more background discussion. > > > > > > Do that from the userspace then. If there is an insufficient interface > > > to do that then let's talk about what is missing. > > > > > > > Currently we only have a proc interface to do full system compaction. > > Here's what missing from this interface: ability to set per-node, per- > > zone, > > per-order, [low, high] extfrag thresholds. This is what I exposed in my > > earlier patch titled 'proactive compaction'. Discussion there made me > > realize > > these are too many tunables and any admin would always get them wrong. > > Even > > if intended user of these sysfs node is some monitoring daemon, its > > tempting to mess with them. > > > > With a callback extension to compact_zone_order() implementing any of the > > per-node, per-zone, per-order limits is straightforward and if needed the > > driver can expose debugfs/sysfs nodes if needed at all. (nvcompact.c > > driver[1] exposes these tunables as debugfs nodes, for example). > > > > [1] https://gitlab.com/nigupta/linux/snippets/1894161 > Now, your proposing a system where we have interfaces from each driver. > That could be more confusing for a sys admin to configure I feel? > > But what your proposing really made me think about what kind > of tunables do we want? Rather than just talking about tunables from the > mm subsystem, can we introduce tunables that indicate the behaviour of > workloads? Using this information from the user, we can look to optimize > reclaim and compaction for that workload. > If we have a tunable which can indicate that the kernel is running in an > environment where the where the workload will be performing a lot of > higher order allocations, we can improve memory reclaim and compaction > considering these parameters. One optimization I can think of extending > kcompactd to compact more memory when a higher order allocation fails. > > One of the biggest issues with having a discussion on proactive > reclaim/compaction is that the workloads are really unpredictable. > Rather than working on tunables from the mm subsystem which help us take > action on memory pressure, can we talk about interfaces to hint about > workloads so that we can make better informed decisions in the mm > subsystem rather than involving other drivers? I'm not adding any tunables, just exposing an interface. -Nitin