Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the working set

Michal Hocko <mhocko@xxxxxxxx> · Mon, 6 Dec 2021 10:59:55 +0100

On Fri 03-12-21 22:27:10, Alexey Avramov wrote:
> >I'd also like to know where that malfunction happens in this case.
> 
> User-space processes need to always access shared libraries to work.
> It can be tens or hundreds of megabytes, depending on the type of workload. 
> This is a hot cache, which is pushed out and then read leads to thrashing. 
> There is no way in the kernel to forbid evicting the minimum file cache. 
> This is the problem that the patch solves. And the malfunction is exactly
> that - the inability of the kernel to hold the minimum amount of the
> hottest cache in memory.

Executable pages are a protected resource already page_check_references.
Shared libraries have more page tables pointing to them so they are more
likely to be referenced and thus kept around. What is the other memory
demand to push those away and cause a trashing?

I do agree with Vlastimil that we should be addressing these problems
rather than papering them over by limits nobody will know how to set
up properly and so we will have to deal all sorts of misconfigured
systems. I have a first hand experience with that in a form of page
cache limit that we used to have in older SLES kernels.

[...]
> > The problem with PSI sensing is that it works after the fact (after 
> > the freeze has already occurred). It is not very different from issuing 
> > SysRq-f manually on a frozen system, although it would still be a 
> > handy feature for batched tasks and remote access. 
> 
> but Michal Hocko immediately criticized [7] the proposal unfairly. 
> This patch just implements ndrw's suggestion.

It would be more productive if you were more specific what you consider
an unfair criticism. Thrashing is a real problem and we all recognize
that. We have much better tools in our tool box these days (refault data
for both page cache and swapped back memory). The kernel itself is
rather conservative when using that data for OOM situations because
historically users were more concerned about pre-mature oom killer
invocations because that is a disruptive action.
For those who prefer very agile oom policy there are userspace tools
which can implement more advanced policies.
I am open to any idea to improve the kernel side of things as well.

As mentioned above I am against global knobs to special case the global
memory reclaim because that leads to inconsistencies with the memcg
reclaim, add future maintenance burden and most importantly it
outsources reponsibility to admins who will have hard time to know the
proper value for those knobs effectivelly pushing them towards all sorts
of cargo cult.

> [0] https://serverfault.com/a/319818
> [1] https://github.com/hakavlad/prelockd
> 
> [2] https://www.youtube.com/watch?v=vykUrP1UvcI
>     On this video: running fast memory hog in a loop on Debian 10 GNOME, 
>     4 GiB MemTotal without swap space. FS is ext4 on *HDD*.
>     - 1. prelockd enabled: about 500 MiB mlocked. Starting 
>         `while true; do tail /dev/zero; done`: no freezes. 
>         The OOM killer comes quickly, the system recovers quickly.
>     - 2. prelockd disabled: system hangs.
> 
> [3] https://www.youtube.com/watch?v=g9GCmp-7WXw
> [4] https://www.youtube.com/watch?v=iU3ikgNgp3M
> [5] Let's talk about the elephant in the room - the Linux kernel's 
>     inability to gracefully handle low memory pressure
>     https://lore.kernel.org/all/d9802b6a-949b-b327-c4a6-3dbca485ec20@xxxxxxx/
> [6] https://lore.kernel.org/all/806F5696-A8D6-481D-A82F-49DEC1F2B035@xxxxxxxxxxxxxx/
> [7] https://lore.kernel.org/all/20190808163228.GE18351@xxxxxxxxxxxxxx/

-- 
Michal Hocko
SUSE Labs