On Fri, Jan 24, 2020 at 11:04:23AM +0100, Michal Hocko wrote: > [Cc Johannes. The collected vmstat data is in http://lkml.kernel.org/r/1579844599463.32567@xxxxxxxxxxx] > > On Fri 24-01-20 05:43:19, Chris Edwards wrote: > > > Could you collect /proc/vmstat every second or so while you observe this > > behavior? This should give us more information that vmstat(8) output. > > > > Hi Michal, > > > > Thanks for the suggestion - I've re-run the test on a 5.5.0-rc6 kernel > > built from source using the default config, which exhibits the same > > behaviour. Please see attachment; I hope the format is OK. > > I personally would have liked one snapshot per file slightly easier to > parse but no problem (I have simply broken out counters per file). In > future the following would be easier to process ;) > while true > do > TS="$(date +%s)" > cp /proc/vmstat vmstat.$TS > sleep 1s > done > > > Here's the timeline of events: > > 18:25:00 start > > 18:25:10 run `stress` to limit available memory (grabs 0.9 x MemAvailable) > > I assume this will allocate anonymous memory. > time 18:25:10 > nr_free_pages 2934822 > nr_inactive_anon 57550 > nr_active_anon 5733 > nr_inactive_file 1428 > nr_active_file 21857 > nr_unevictable 6102 > pswpin 8 > pswpout 390136 > > So there is 11GB of free memory. And 1.5GB of memory swapped out in the > past (probably a result of previous tests), we are going to use this > number as a base for future comparing because pswpout counter is > incremental. > > Anonymous LRUs have 240MB of memory and there is 90MB of file backed. > > > 18:25:20 run `dd` to exercise the buffer cache > > time 18:25:20 > nr_free_pages 367818 > nr_inactive_anon 57693 > nr_active_anon 2560480 > nr_inactive_file 7110 > nr_active_file 23332 > nr_unevictable 6195 > pswpin 8 > pswpout 390136 > > The free memory dropped to 1.4GB as a result of your `stress` load. All > that memory landed in the anonymous LRU lists (9GB of memory comparing > to 240MB before the test). File backed memory's grown to 118MB. No > swapout/in durinf that time period. > > Nothing really unexpected so far. There is still quite some room to fit > the IO workload in. Let's see how the pswpout evolves over time. > > $ awk '{diff=$1-prev; if (prev&&diff) printf "%d %d %d\n", NR, $1, diff; prev=$1}' pswpout > 30 392136 2000 > 31 395513 3377 > 32 399132 3619 > 33 403101 3969 > 34 407211 4110 > 35 410812 3601 > 36 414120 3308 > 37 418119 3999 > 38 422116 3997 > 39 424154 2038 > 40 428110 3956 > > So the swappout started around 18:25:00 > $ sed '1,28d;' nr_free_pages | head > 118413 > 100516 > 98751 > 95914 > 97059 > 101303 > 101588 > 97801 > 99415 > 99842 > > The free memory dropped down to ~400MB which is likely the > min_free_kbytes defined watermark > > $ sed '1,28d;' nr_inactive_anon | head -n3 > 57633 > 57828 > 58932 > $ sed '1,28d;' nr_active_anon | head -n3 > 2560522 > 2560148 > 2559087 > > Anonymous list around 10GB > > $ sed '1,28d;' nr_inactive_file | head -n3 > 255957 > 276400 > 278865 > $ sed '1,28d;' nr_active_file | head -n3 > 23334 > 23439 > 22743 > > File lists 1.1GB. Inactive file LRU is quite large and > $ sed '1,28d;' nr_dirty | head -n3 > 0 > 0 > 0 > $ sed '1,28d;' nr_writeback | head -n3 > 0 > 0 > 141 > > The data shouldn't be dirty so we should preferably reclaim those pages > rather than swap out. That is little bit surprising to me. Johannes what > do you think about this? There are a couple of workingset_activate - not many, but it could be enough. I wonder if Kuo-Hsin's patch is a bit too aggressive: 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa We may want to change the logic such that we scan active file when there are inactive refaults, but only go for anon if there are active refaults. We'd need to take snapshots of WORKINGSET_RESTORE as well. (There are no restore events in the logs, meaning the active file list is turning over a bit, but the cache isn't thrashing per se). Just to confirm, Chris, would you be able to test whether the following patch fixes the problem you are seeing? diff --git a/mm/vmscan.c b/mm/vmscan.c index 74e8edce83ca..1f1403681960 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2744,7 +2744,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) * anonymous pages. */ file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE); - if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE)) + if (file >> sc->priority && !inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) sc->cache_trim_mode = 1; else sc->cache_trim_mode = 0;