Re: [PATCH] mm: Stop kswapd early when nothing's waiting for it to free pages

Michal Hocko <mhocko@xxxxxxxxxx> · Tue, 25 Feb 2020 10:09:45 +0100

On Fri 21-02-20 13:08:24, Sultan Alsawaf wrote:
[...]
> Both of these logs are attached in a tarball.

Thanks! First of all
$ grep pswp vmstat.1582318979
pswpin 0
pswpout 0

suggests that you do not have any swap storage, right? I will get back
to this later. Now, let's have a look at snapshots. We have regular 1s
snapshots intially but then we have
vmstat.1582318734
vmstat.1582318736
vmstat.1582318758
vmstat.1582318763
vmstat.1582318768
[...]
vmstat.1582318965
vmstat.1582318975
vmstat.1582318976

That is 242s time period when even a simple bash script was struggling
to write a snapshot of a /proc/vmstat which by itself shouldn't really
depend on the system activity much. Let's have a look at a random chosen
two consecutive snapshots from this time period:

		vmstat.1582318736	vmstat.1582318758
			base		diff
allocstall_dma  	0       	0
allocstall_dma32        0       	0
allocstall_movable      5773    	0
allocstall_normal       906     	0

to my surprise there was no invocation of the direct reclaim in this
time period. I would expect this to be the case considering the long
stall. But the source of the stall might be different than the DR.

compact_stall   	13      	1

Direct compaction has been invoked but this shouldn't cause a major
stall for all processes.

nr_active_anon  	133932  	236
nr_inactive_anon        9350    	-1179
nr_active_file  	318     	190
nr_inactive_file        673     	56
nr_unevictable  	51984   	0

The amount of anonymous memory is not really high (~560MB) but file LRU
is _really_ low (~3MB), unevictable list is at ~200MB. That gets us to
~760M of memory which is 74% of the memory. Please note that your mem=2G
setup gives you only 1G of memory in fact (based on the zone_info you
have posted). That is not something unusual but the amount of the page
cache is worrying because I would expect a heavy trashing because most
of the executables are going to require major faults. Anonymous memory
is not swapped out obviously so there is no other option than to refault
constantly.

pgscan_kswapd   	64788716        14157035
pgsteal_kswapd  	29378868        4393216
pswpin  		0       	0
pswpout 		0       	0
workingset_activate     3840226 	169674
workingset_refault      29396942        4393013
workingset_restore      2883042 	106358

And here we can see it clearly happening. Note how working set refaults
matches the amount of memory reclaimed by kswapd.

I would be really curious whether adding swap space would help some.

Now to your patch and why it helps here. It seems quite obvious that the
only effectively reclaimable memory (page cache) is not going to satisfy
the high watermark target
Node 0, zone    DMA32
  pages free     87925
        min      11090
        low      13862
        high     16634

kswapd has some feedback mechanism to back off when the zone is hopless
from the reclaim point of view AFAIR but it seems it has failed in this
particular situation. It should have relied on the direct reclaim and
eventually trigger the OOM killer. Your patch has worked around this by
bailing out from the kswapd reclaim too early so a part of the page
cache required for the code to move on would stay resident and move
further.

The proper fix should, however, check the amount of reclaimable pages
and back off if they cannot meet the target IMO. We cannot rely on the
general reclaimability here because that could really be thrashing.

Thoughts?
-- 
Michal Hocko
SUSE Labs