Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 5 Aug 2019 14:55:42 -0400

On Mon, Aug 05, 2019 at 03:31:19PM +0200, Michal Hocko wrote:
> On Mon 05-08-19 14:13:16, Vlastimil Babka wrote:
> > On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > > Hello,
> > > 
> > > There's this bug which has been bugging many people for many years
> > > already and which is reproducible in less than a few minutes under the
> > > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > > defaults.
> > > 
> > > Steps to reproduce:
> > > 
> > > 1) Boot with mem=4G
> > > 2) Disable swap to make everything faster (sudo swapoff -a)
> > > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > > 4) Start opening tabs in either of them and watch your free RAM decrease
> > > 
> > > Once you hit a situation when opening a new tab requires more RAM than
> > > is currently available, the system will stall hard. You will barely  be
> > > able to move the mouse pointer. Your disk LED will be flashing
> > > incessantly (I'm not entirely sure why). You will not be able to run new
> > > applications or close currently running ones.
> > 
> > > This little crisis may continue for minutes or even longer. I think
> > > that's not how the system should behave in this situation. I believe
> > > something must be done about that to avoid this stall.
> > 
> > Yeah that's a known problem, made worse SSD's in fact, as they are able
> > to keep refaulting the last remaining file pages fast enough, so there
> > is still apparent progress in reclaim and OOM doesn't kick in.
> > 
> > At this point, the likely solution will be probably based on pressure
> > stall monitoring (PSI). I don't know how far we are from a built-in
> > monitor with reasonable defaults for a desktop workload, so CCing
> > relevant folks.
> 
> Another potential approach would be to consider the refault information
> we have already for file backed pages. Once we start reclaiming only
> workingset pages then we should be trashing, right? It cannot be as
> precise as the cost model which can be defined around PSI but it might
> give us at least a fallback measure.

NAK, this does *not* work. Not even as fallback.

There is no amount of refaults for which you can say whether they are
a problem or not. It depends on the disk speed (obvious) but also on
the workload's memory access patterns (somewhat less obvious).

For example, we have workloads whose cache set doesn't quite fit into
memory, but everything else is pretty much statically allocated and it
rarely touches any new or one-off filesystem data. So there is always
a steady rate of mostly uninterrupted refaults, however, most data
accesses are hitting the cache! And we have fast SSDs that compensate
for the refaults that do occur. The workload runs *completely fine*.

If the cache hit rate was lower and refaults would make up a bigger
share of overall page accesses, or if there was a spinning disk in
that machine, the machine would be completely livelocked - with the
same exact number of refaults and the same amount of RAM!

That's not just an approximation error that we could compensate
for. The same rate of refaults in a system could mean anything from 0%
(all refaults readahead, and IO is done before workload notices) to
100% memory pressure (all refaults are cache misses and workload fully
serialized on pages in question) - and anything in between (a subset
of threads of the workload wait for a subset of the refaults).

The refault rate by itself carries no signal on workload progress.

This is the whole reason why psi was developed - to compare the time
you spend on refaults (encodes IO speed and readhahead efficiency)
compared to the time you spend on being productive (encodes refaults
as share of overall memory accesses of a the workload).