Hi Shakeel, On Tue, Apr 23, 2019 at 08:30:46AM -0700, Shakeel Butt wrote: > Though this is quite late, I still want to propose a topic for > discussion during LSFMM'19 which I think will be beneficial for Linux > users in general but particularly the data center users running a > range of different workloads and want to reduce the memory cost. > > Topic: Proactive Memory Reclaim > > Motivation/Problem: Memory overcommit is most commonly used technique > to reduce the cost of memory by large infrastructure owners. However > memory overcommit can adversely impact the performance of latency > sensitive applications by triggering direct memory reclaim. Direct > reclaim is unpredictable and disastrous for latency sensitive > applications. > > Solution: Proactively reclaim memory from the system to drastically > reduce the occurrences of direct reclaim. Target cold memory to keep > the refault rate of the applications acceptable (i.e. no impact on the > performance). > > Challenges: > 1. Tracking cold memory efficiently. > 2. Lack of infrastructure to reclaim specific memory. > > Details: Existing "Idle Page Tracking" allows tracking cold memory on > a system but it becomes prohibitively expensive as the machine size > grows. Also there is no way from the user space to reclaim a specific > 'cold' page. I want to present our implementation of cold memory > tracking and reclaim. The aim is to make it more generally beneficial > to lot more users and upstream it. > > More details: > "Software-driven far-memory in warehouse-scale computers", ASPLOS'19. > https://youtu.be/aKddds6jn1s I would be very interested to hear about this as well. As Rik mentions, I've been working on a way to determine the "true" memory workingsets of our workloads. I'm using a pressure feedback loop of psi and dynamically adjusted cgroup limits, to harness the kernel's LRU/clock algorithm to sort out what's cold and what isn't. This does use direct reclaim, but since psi quantifies the exact time cost of that, it backs off before our SLAs are violated. Of course, if necessary, this work could easily be punted to a kthread or something. The additional refault IO also has not been a problem in practice for us so far, since our pressure parameters are fairly conservative. But that is a bit harder to manage - by the time you experience those you might have already oversteered. This is where compression could help reduce the cost of being aggressive. That said, even with conservative settings I've managed to shave off 25-30% of the memory footprint of common interactive jobs without affecting their performance. I suspect that in many workloads (depending on their exact slope of the access locality bell curve) shaving off more would require a disproportionate amount more pressure/CPU/IO, and so might not be worthwile. Anyway, I'd love to hear your insights on this.