Re: [LSF/MM TOPIC] Proactive Memory Reclaim

Shakeel Butt <shakeelb@xxxxxxxxxx> · Tue, 23 Apr 2019 10:04:19 -0700

On Tue, Apr 23, 2019 at 9:08 AM Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Tue, 2019-04-23 at 08:30 -0700, Shakeel Butt wrote:
>
> > Topic: Proactive Memory Reclaim
> >
> > Motivation/Problem: Memory overcommit is most commonly used technique
> > to reduce the cost of memory by large infrastructure owners. However
> > memory overcommit can adversely impact the performance of latency
> > sensitive applications by triggering direct memory reclaim. Direct
> > reclaim is unpredictable and disastrous for latency sensitive
> > applications.
>
> This sounds similar to a project Johannes has
> been working on, except he is not tracking which
> memory is idle at all, but only the pressure on
> each cgroup, through the PSI interface:
>
> https://facebookmicrosites.github.io/psi/docs/overview
>

I think both techniques are orthogonal and can be used concurrently.
This technique proactively reclaims memory and hopes that we don't go
to direct reclaim but in the worst case if we trigger direct reclaim
then we can use PSI to early detect when to give up on reclaim and
trigger oom-kill.

Another thing I want to point out is our usage model: this proactive
memory reclaim is transparent to the jobs. The admin (infrastructure
owner) is using proactive reclaim to create more schedulable memory
transparently to the job owners.

> Discussing the pros and cons, and experiences with
> both approaches seems like a useful topic. I'll add
> it to the agenda.
>

Thanks a lot.
Shakeel