Re: user space unresponsive, followup: lsf/mm congestion

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 10 Jan 2020 12:07:39 +0100

On Thu 09-01-20 23:12:49, Chris Murphy wrote:
> On Thu, Jan 9, 2020 at 4:53 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > On Thu 09-01-20 12:51:48, Michal Hocko wrote:
> > > > I've updated the bug, attaching kernel messages and /proc/vmstate in
> > > > 1s increments, although quite often during the build multiple seconds
> > > > of sampling were just skipped as the system was under too much
> > > > pressure.
> > >
> > > I have a tool to reduce that problem (see attached).
> >
> > Now for real
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=206117#c6

Let me paste the comment here:
: Using Michal Hocko's read_vmstat.c to record ~12 minutes while
: running 'ninja' to compile webkitgtk. I killed ninja ~20s before
: killing read_vmstat. The system is essentially unusable for 7 minutes
: prior to killing ninja which could only be done remotely (which took
: maybe 1/2 minute to enter and execute the kill command; ssh was very
: sluggish). During those 7 minutes, swap was not even 1/2 full, but the
: GUI frozen most of the time including the mouse pointer.

file:stdout timeout:1.0s buffer_size:10485760
=S 1578635397

So you have redirected the output (stdout) to a file. This is less
effective than using a file directly because the progy makes sure to
preallocate and mlock the output file data as well. Anyway, let's have a
look what you managed to gather

$ grep "=S" bug206117_read_vmstat.out | awk '{diff=$2-prev; if (prev && diff > 1) printf "%d %d\n", prev, diff; prev=$2}'
1578635742 2
1578635755 2
1578635776 2
1578635878 2
1578635943 2
1578635967 2
1578636027 2
1578636053 2
1578636111 3
1578636131 2
1578636138 2

so there are few instances when the next snapshot was not gathered next
second. Most of the time it was 2s but that doesn't necessarily mean a
problem because the timestamp resolution could have caused that this was
just bad timing and the diff was close to 1s. There is one timestamp
when this was 3s which looks like a real stall though (this shouldn't
happen with the mlocked output file).  Let's focus on the timestep
though (it is 11min since start):

		1578636111	diff
pgalloc_dma     4       	0
pgalloc_movable 0       	0
pgalloc_dma32   6468509 	14278
pgalloc_normal  40905599        37286

that's 200M of allocations which is not much

pgsteal_kswapd  3384259 	10763
pgscan_kswapd   5282874 	21772
pgsteal_direct  3189026 	30400
pgscan_direct   8372523 	85974

which is 106M reclaimed during that time and the reclaim was not
particularly effective kswapd 49% and direct reclaim 35%.

nr_active_file  	26652   -3564
nr_inactive_file	25731   -4772
nr_active_anon  	1502798 6433
nr_inactive_anon        283561  4385

File LRUs are quite small (~200M) and got shrunk some while the
anonymous LRUs eat almost 8G. while the actively referenced list is 5
times bigger than inactive (5.8G). That suggests that a large part of
the memory is heavily in use.

pswpin  1222809 	4882
pswpout 3475546 	7814

Around 30M has been swapped out but around 60% of that amount has been
swapped back in and that suggests that we are actively refaulting on the
swap.
Please note that a large number of anonymous refaults on swap entries
might be satisfied from the swap cache. That information is not
presented in /proc/vmstate unfortunately.

On the page cache front
workingset_activate     661011  7566
workingset_refault      1894851 14868
workingset_restore      398375  4622

50% of refaults are active and 30% are marked as a workingset.

Now let's see how those numbers evolve over time
$ grep pswp bug206117_read_vmstat.out | awk '/pswpin/{pswpin=$2} /pswpout/{pswpout=$2; if (pswpout) {print pswpin*100/pswpout}}' | calc_min_max.awk
min: 0.06 max: 35.28 avg: 22.49 std: 11.30 nr: 592

The swapin refault builds up over time which matches the expectation
that the memory demand builds up as well and it doesn't fit into RAM
while the workload still needs the memory that has been swapped out.

The first swapout happens shortly before we start the direct reclaim:
$ awk '/=S/{ts=$2} /allocstall/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out
1578635582
$ awk '/=S/{ts=$2} /pswpout/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out
1578635581

Page cache refault have a slightly higher dynamics
$ grep workingset bug206117_read_vmstat.out | awk '/workingset_refault/{workingset_refault=$2} /workingset_activate/{workingset_activate=$2; if (workingset_refault) {print workingset_activate*100/workingset_refault}}'
100
91.453
91.453
91.6201
91.6201
96.587
96.7213
96.7638
96.7777
97.1401
70.2157
[...]
22.4038
22.4447
22.3699
22.3929
22.3724
22.3897
22.3873
22.3846
22.3847
[...]
34.9863
34.9863
34.9847
34.9847
34.9847
34.9846
34.9239

which means that as soon as we start refaulting it is the activating
memory then the activation decreases for some time, most likely because
we start swapping out more and the activation rate stays around 30% in
the end.

All that being said numbers confirm my initial suspicion. The workload
you are trying to run simply doesn't fit into memory and therefore it
has to refault both from the disk for the page cache (likely also
executables) and the swap space.

It is not really obvious whether changing file vs. anonymous reclaim
balance would help much in this case. As we can see swap out starts
before tasks are stalled in the direct reclaim. It is not really clear
whether swapping out more would help to put some relief on the
pagecache. So to me it looks like the reclaim is not doing a terrible
job here considering the memory demand.

It would interesting to see whether tuning vm_swappiness to 100 helps
but considering how large is the anonymous active list I would be very
skeptical.

So in the end it is really hard to see what the kernel should have done
better in this overcommitted case. Killing memory hogs would likely kill
an active workload which would lead to better desktop experience but I
can imagine setups which simply want to have work done albeit sloooowly.
-- 
Michal Hocko
SUSE Labs