Re: [RFC] Mechanism to induce memory reclaim

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon 07-03-22 15:26:18, Johannes Weiner wrote:
> On Mon, Mar 07, 2022 at 06:31:41PM +0000, Shakeel Butt wrote:
> > On Mon, Mar 07, 2022 at 03:41:45PM +0100, Michal Hocko wrote:
> > > On Sun 06-03-22 15:11:23, David Rientjes wrote:
> > > [...]
> > > > Some questions to get discussion going:
> > > >
> > > >  - Overall feedback or suggestions for the proposal in general?
> > 
> > > Do we really need this interface? What would be usecases which cannot
> > > use an existing interfaces we have for that? Most notably memcg and
> > > their high limit?
> > 
> > 
> > Let me take a stab at this. The specific reasons why high limit is not a
> > good interface to implement proactive reclaim:
> > 
> > 1) It can cause allocations from the target application to get
> > throttled.
> > 
> > 2) It leaves a state (high limit) in the kernel which needs to be reset
> > by the userspace part of proactive reclaimer.
> > 
> > If I remember correctly, Facebook actually tried to use high limit to
> > implement the proactive reclaim but due to exactly these limitations [1]
> > they went the route [2] aligned with this proposal.
> > 
> > To further explain why the above limitations are pretty bad: The
> > proactive reclaimers usually use feedback loop to decide how much to
> > squeeze from the target applications without impacting their performance
> > or impacting within a tolerable range. The metrics used for the feedback
> > loop are either refaults or PSI and these metrics becomes messy due to
> > application getting throttled due to high limit.
> > 
> > For (2), the high limit interface is a very awkward interface to use to
> > do proactive reclaim. If the userspace proactive reclaimer fails/crashed
> > due to whatever reason during triggering the reclaim in an application,
> > it can leave the application in a bad state (memory pressure state and
> > throttled) for a long time.
> 
> Yes.
> 
> In addition to the proactive reclaimer crashing, we also had problems
> of it simply not responding quickly enough.
> 
> Because there is a delay between reclaim (action) and refaults
> (feedback), there is a very real upper limit of pages you can
> reasonably reclaim per second, without risking pressure spikes that
> far exceed tolerances. A fixed memory.high limit can easily exceed
> that safe reclaim rate when the workload expands abruptly. Even if the
> proactive reclaimer process is alive, it's almost impossible to step
> between a rapidly allocating process and its cgroup limit in time.
> 
> The semantics of writing to memory.high also require that the new
> limit is met before returning to userspace. This can take a long time,
> during which the reclaimer cannot re-evaluate the optimal target size
> based on observed pressure. We routinely saw the reclaimer get stuck
> in the kernel hammering a suffering workload down to a stale target.
> 
> We tried for quite a while to make this work, but the limit semantics
> turned out to not be a good fit for proactive reclaim.

Thanks for sharing your experience, Johannes. This is a useful insight.

> A mechanism to request a fixed number of pages to reclaim turned out
> to work much, much better in practice. We've been using a simple
> per-cgroup knob (like here: https://lkml.org/lkml/2020/9/9/1094).

Could you share more details here please? How have you managed to find
the reclaim target and how have you overcome challenges to react in time
to have some head room for the actual reclaim?
 
> With tiered memory systems coming up, I can see the need for
> restricting to specific numa nodes. Demoting from DRAM to CXL has a
> different cost function than evicting RAM/CXL to storage, and those
> two things probably need to happen at different rates.

Yes, in an absense of per-node watermarks I can see how a per-node
reclaim trigger could be useful. The question is whether a per-node
wmark interface wouldn't be a better fit.

-- 
Michal Hocko
SUSE Labs




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux