Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

ndrw.xf@xxxxxxxxxxxxxx · Thu, 08 Aug 2019 18:57:02 +0100

On 8 August 2019 17:32:28 BST, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
>> Would it be possible to reserve a fixed (configurable) amount of RAM
>for caches,
>
>I am afraid there is nothing like that available and I would even argue
>it doesn't make much sense either. What would you consider to be a
>cache? A kernel/userspace reclaimable memory? What about any other in
>kernel memory users? How would you setup such a limit and make it
>reasonably maintainable over different kernel releases when the memory
>footprint changes over time?

Frankly, I don't know. The earlyoom userspace tool works well enough for me so I assumed this functionality could be implemented in kernel. Default thresholds would have to be tested but it is unlikely zero is the optimum value. 

>Besides that how does that differ from the existing reclaim mechanism?
>Once your cache hits the limit, there would have to be some sort of the
>reclaim to happen and then we are back to square one when the reclaim
>is
>making progress but you are effectively treshing over the hot working
>set (e.g. code pages)

By forcing OOM killer. Reclaiming memory when system becomes unresponsive is precisely what I want to avoid.

>> and trigger OOM killer earlier, before most UI code is evicted from
>memory?
>
>How does the kernel knows that important memory is evicted?

I assume current memory management policy (LRU?) is sufficient to keep most frequently used pages in memory.

>If you know which task is that then you can put it into a memory cgroup
>with a stricter memory limit and have it killed before the overal
>system
>starts suffering.

This is what I intended to use. But I don't know how to bypass SystemD or configure such policies via SystemD. 

>PSI is giving you a matric that tells you how much time you
>spend on the memory reclaim. So you can start watching the system from
>lower utilization already.

This is a fantastic news. Really. I didn't know this is how it works. Two potential issues, though:
1. PSI (if possible) should be normalised wrt the memory reclaiming cost (SSDs have lower cost than HDDs). If not automatically then perhaps via a user configurable option. That's somewhat similar to having configurable PSI thresholds. 
2. It seems PSI measures the _rate_ pages are evicted from memory. While this may correlate with the _absolute_ amount of of memory left, it is not the same. Perhaps weighting PSI with absolute amount of memory used for caches would improve this metric.

Best regards,
ndrw