On Tue, Mar 08, 2022 at 01:53:19PM +0100, Michal Hocko wrote: > On Mon 07-03-22 15:26:18, Johannes Weiner wrote: > > On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote: > > > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote: > > > > On Sun 06-03-22 15:11:23, David Rientjes wrote: > > > > [...] > > > > > Some questions to get discussion going: > > > > > > > > > > - Overall feedback or suggestions for the proposal in general? > > > > > > > Do we really need this interface? What would be usecases which cannot > > > > use an existing interfaces we have for that? Most notably memcg and > > > > their high limit? > > > > > > > > > Let me take a stab at this. The specific reasons why high limit is not a > > > good interface to implement proactive reclaim: > > > > > > 1) It can cause allocations from the target application to get > > > throttled. > > > > > > 2) It leaves a state (high limit) in the kernel which needs to be reset > > > by the userspace part of proactive reclaimer. > > > > > > If I remember correctly, Facebook actually tried to use high limit to > > > implement the proactive reclaim but due to exactly these limitations [1] > > > they went the route [2] aligned with this proposal. > > > > > > To further explain why the above limitations are pretty bad: The > > > proactive reclaimers usually use feedback loop to decide how much to > > > squeeze from the target applications without impacting their performance > > > or impacting within a tolerable range. The metrics used for the feedback > > > loop are either refaults or PSI and these metrics becomes messy due to > > > application getting throttled due to high limit. > > > > > > For (2), the high limit interface is a very awkward interface to use to > > > do proactive reclaim. If the userspace proactive reclaimer fails/crashed > > > due to whatever reason during triggering the reclaim in an application, > > > it can leave the application in a bad state (memory pressure state and > > > throttled) for a long time. > > > > Yes. > > > > In addition to the proactive reclaimer crashing, we also had problems > > of it simply not responding quickly enough. > > > > Because there is a delay between reclaim (action) and refaults > > (feedback), there is a very real upper limit of pages you can > > reasonably reclaim per second, without risking pressure spikes that > > far exceed tolerances. A fixed memory.high limit can easily exceed > > that safe reclaim rate when the workload expands abruptly. Even if the > > proactive reclaimer process is alive, it's almost impossible to step > > between a rapidly allocating process and its cgroup limit in time. > > > > The semantics of writing to memory.high also require that the new > > limit is met before returning to userspace. This can take a long time, > > during which the reclaimer cannot re-evaluate the optimal target size > > based on observed pressure. We routinely saw the reclaimer get stuck > > in the kernel hammering a suffering workload down to a stale target. > > > > We tried for quite a while to make this work, but the limit semantics > > turned out to not be a good fit for proactive reclaim. > > Thanks for sharing your experience, Johannes. This is a useful insight. Just to add another issue with memory.high - there's a race window between reading memory.current and setting memory.high if you want to reclaim just a little bit of memory. On a fast expanding workload this could result in reclaiming much more than intended. > > > A mechanism to request a fixed number of pages to reclaim turned out > > to work much, much better in practice. We've been using a simple > > per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094). > > Could you share more details here please? How have you managed to find > the reclaim target and how have you overcome challenges to react in time > to have some head room for the actual reclaim? We have a userspace agent that just repeatedly triggers proactive reclaim and monitors PSI metrics to maintain some constant but low pressure. In the complete absense of pressure we will reclaim some configurable percentage of the workload's memory. This reclaim amount tapers down to zero as PSI approaches the target threshold. I don't follow your question regarding head-room. Could you elaborate?