Re: [LSF/MM TOPIC] Proactive Memory Reclaim

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 23 Apr 2019 13:31:28 -0400

Hi Shakeel,

On Tue, Apr 23, 2019 at 08:30:46AM -0700, Shakeel Butt wrote:
> Though this is quite late, I still want to propose a topic for
> discussion during LSFMM'19 which I think will be beneficial for Linux
> users in general but particularly the data center users running a
> range of different workloads and want to reduce the memory cost.
> 
> Topic: Proactive Memory Reclaim
> 
> Motivation/Problem: Memory overcommit is most commonly used technique
> to reduce the cost of memory by large infrastructure owners. However
> memory overcommit can adversely impact the performance of latency
> sensitive applications by triggering direct memory reclaim. Direct
> reclaim is unpredictable and disastrous for latency sensitive
> applications.
> 
> Solution: Proactively reclaim memory from the system to drastically
> reduce the occurrences of direct reclaim. Target cold memory to keep
> the refault rate of the applications acceptable (i.e. no impact on the
> performance).
> 
> Challenges:
> 1. Tracking cold memory efficiently.
> 2. Lack of infrastructure to reclaim specific memory.
> 
> Details: Existing "Idle Page Tracking" allows tracking cold memory on
> a system but it becomes prohibitively expensive as the machine size
> grows. Also there is no way from the user space to reclaim a specific
> 'cold' page. I want to present our implementation of cold memory
> tracking and reclaim. The aim is to make it more generally beneficial
> to lot more users and upstream it.
> 
> More details:
> "Software-driven far-memory in warehouse-scale computers", ASPLOS'19.
> https://youtu.be/aKddds6jn1s

I would be very interested to hear about this as well.

As Rik mentions, I've been working on a way to determine the "true"
memory workingsets of our workloads. I'm using a pressure feedback
loop of psi and dynamically adjusted cgroup limits, to harness the
kernel's LRU/clock algorithm to sort out what's cold and what isn't.

This does use direct reclaim, but since psi quantifies the exact time
cost of that, it backs off before our SLAs are violated. Of course, if
necessary, this work could easily be punted to a kthread or something.

The additional refault IO also has not been a problem in practice for
us so far, since our pressure parameters are fairly conservative. But
that is a bit harder to manage - by the time you experience those you
might have already oversteered. This is where compression could help
reduce the cost of being aggressive. That said, even with conservative
settings I've managed to shave off 25-30% of the memory footprint of
common interactive jobs without affecting their performance. I suspect
that in many workloads (depending on their exact slope of the access
locality bell curve) shaving off more would require a disproportionate
amount more pressure/CPU/IO, and so might not be worthwile.

Anyway, I'd love to hear your insights on this.