On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote: > On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > > > [...] > > > > My take is that a proactive reclaim feature, whose goal is never to > > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > > would ideally have: > > > > > > > > - a pressure or size target specified by userspace but with > > > > enforcement driven inside the kernel from the allocation path > > > > > > > > - the enforcement work NOT be done synchronously by the workload > > > > (something I'd argue we want for *all* memory limits) > > > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > > cgroup's memory allocations causing the work (again something I'd > > > > argue we want in general) > > > > > > > > - a delegatable knob that is independent of setting the maximum size > > > > of a container, as that expresses a different type of policy > > > > > > > > - if size target, self-limiting (ha) enforcement on a pressure > > > > threshold or stop enforcement when the userspace component dies > > > > > > > > Thoughts? > > > > > > Agreed with above points. What do you think about > > > http://lkml.kernel.org/r/20200922190859.GH12990@xxxxxxxxxxxxxx. > > > > I definitely agree with what you wrote in this email for background > > reclaim. Indeed, your description sounds like what I proposed in > > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@xxxxxxxxxxx/ > > - what's missing from that patch is proper work attribution. > > > > > I assume that you do not want to override memory.high to implement > > > this because that tends to be tricky from the configuration POV as > > > you mentioned above. But a new limit (memory.middle for a lack of a > > > better name) to define the background reclaim sounds like a good fit > > > with above points. > > > > I can see that with a new memory.middle you could kind of sort of do > > both - background reclaim and proactive reclaim. > > > > That said, I do see advantages in keeping them separate: > > > > 1. Background reclaim is essentially an allocation optimization that > > we may want to provide per default, just like kswapd. > > > > Kswapd is tweakable of course, but I think actually few users do, > > and it works pretty well out of the box. It would be nice to > > provide the same thing on a per-cgroup basis per default and not > > ask users to make decisions that we are generally better at making. > > > > 2. Proactive reclaim may actually be better configured through a > > pressure threshold rather than a size target. > > > > As per above, the goal is not to be punitive or containing. The > > goal is to keep the LRUs warm and move the colder pages to disk. > > > > But how aggressively do you run reclaim for this purpose? What > > target value should a user write to such a memory.middle file? > > > > For one, it depends on the job. A batch job, or a less important > > background job, may tolerate higher paging overhead than an > > interactive job. That means more of its pages could be trimmed from > > RAM and reloaded on-demand from disk. > > > > But also, it depends on the storage device. If you move a workload > > from a machine with a slow disk to a machine with a fast disk, you > > can page more data in the same amount of time. That means while > > your workload tolerances stays the same, the faster the disk, the > > more aggressively you can do reclaim and offload memory. > > > > So again, what should a user write to such a control file? > > > > Of course, you can approximate an optimal target size for the > > workload. You can run a manual workingset analysis with page_idle, > > damon, or similar, determine a hot/cold cutoff based on what you > > know about the storage characteristics, then echo a number of pages > > or a size target into a cgroup file and let kernel do the reclaim > > accordingly. The drawbacks are that the kernel LRU may do a > > different hot/cold classification than you did and evict the wrong > > pages, the storage device latencies may vary based on overall IO > > pattern, and two equally warm pages may have very different paging > > overhead depending on whether readahead can avert a major fault or > > not. So it's easy to overshoot the tolerance target and disrupt the > > workload, or undershoot and have stale LRU data, waste memory etc. > > > > You can also do a feedback loop, where you guess an optimal size, > > then adjust based on how much paging overhead the workload is > > experiencing, i.e. memory pressure. The drawbacks are that you have > > to monitor pressure closely and react quickly when the workload is > > expanding, as it can be potentially sensitive to latencies in the > > usec range. This can be tricky to do from userspace. > > > > This is actually what we do in our production i.e. feedback loop to > adjust the next iteration of proactive reclaim. That's what we do also right now. It works reasonably well, the only two pain points are/have been the reaction time under quick workload expansion and inadvertently forcing the workload into direct reclaim. > We eliminated the IO or slow disk issues you mentioned by only > focusing on anon memory and doing zswap. Interesting, may I ask how the file cache is managed in this setup? > > So instead of asking users for a target size whose suitability > > heavily depends on the kernel's LRU implementation, the readahead > > code, the IO device's capability and general load, why not directly > > ask the user for a pressure level that the workload is comfortable > > with and which captures all of the above factors implicitly? Then > > let the kernel do this feedback loop from a per-cgroup worker. > > I am assuming here by pressure level you are referring to the PSI like > interface e.g. allowing the users to tell about their jobs that X > amount of stalls in a fixed time window is tolerable. Right, essentially the same parameters that psi poll() would take.