Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 13 Aug 2019 15:47:44 +0200

On 8/9/19 7:31 PM, Johannes Weiner wrote:
>> It made a difference, but not enough, it seems. Before the patch I could
>> observe "io:full avg10" around 75% and "memory:full avg10" around 20%,
>> after the patch, "memory:full avg10" went to around 45%, while io stayed
>> the same (BTW should the refaults be discounted from the io counters, so
>> that the sum is still <=100%?)
>>
>> As a result I could change the knobs to recover successfully with
>> thrashing detected for 10s of 40% memory pressure.
>>
>> Perhaps being low on memory we can't detect refaults so well due to
>> limited number of shadow entries, or there was genuine non-refault I/O
>> in the mix. The detection would then probably have to look at both I/O
>> and memory?
> 
> Thanks for testing it. It's possible that there is legitimate
> non-refault IO, and there can be interaction of course between that
> and the refault IO. But to be sure that all genuine refaults are
> captured, can you record the workingset_* values from /proc/vmstat
> before/after the thrash storm? In particular, workingset_nodereclaim
> would indicate whether we are losing refault information.

Let's see... after a ~45 second stall that I ended up by alt-sysrq-f, I
see the following pressure info:

cpu:some avg10=1.04 avg60=2.22 avg300=2.01 total=147402828
io:some avg10=97.13 avg60=65.48 avg300=28.86 total=240442256
io:full avg10=83.93 avg60=57.05 avg300=24.56 total=212125506
memory:some avg10=54.62 avg60=33.69 avg300=15.89 total=67989547
memory:full avg10=44.48 avg60=28.17 avg300=13.17 total=55963961

Captured vmstat workingset values

before:
workingset_nodes 15756
workingset_refault 6111959
workingset_activate 1805063
workingset_restore 919138
workingset_nodereclaim 40796
pgpgin 33889644

after:
workingset_nodes 14842
workingset_refault 9248248
workingset_activate 1966317
workingset_restore 961179
workingset_nodereclaim 41060
pgpgin 46488352

Doesn't seem like losing too much refault info, and it's indeed a mix of
refaults and other I/O? (difference is 3M for refaults and 12.5M for
pgpgin).

> [ The different resource pressures are not meant to be summed
>   up. Refaults truly are both IO events and memory events: they
>   indicate memory contention, but they also contribute to the IO
>   load. So both metrics need to include them, or it would skew the
>   picture when you only look at one of them. ]

Understood, makes sense.