On Tue 30-10-18 14:44:27, Vlastimil Babka wrote: > On 10/22/18 3:19 AM, Marinko Catovic wrote: > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic [...] > >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > >> > > > > There we go again. > > > > First of all, I have set up this monitoring on 1 host, as a matter of > > fact it did not occur on that single > > one for days and weeks now, so I set this up again on all the hosts > > and it just happened again on another one. > > > > This issue is far from over, even when upgrading to the latest 4.18.12 > > > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz > > I have plot the vmstat using the attached script, and got the attached > plots. X axis are the vmstat snapshots, almost 14k of them, each for 5 > seconds, so almost 19 hours. I can see the following phases: Thanks a lot. I like the script much! [...] > 12000 - end: > - free pages growing sharply > - page cache declining sharply > - slab still slowly declining $ cat filter pgfree pgsteal_ pgscan_ compact nr_free_pages $ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}' nr_free_pages 4216371 pgfree 267884025 pgsteal_kswapd 0 pgsteal_direct 11890416 pgscan_kswapd 0 pgscan_direct 11937805 compact_migrate_scanned 2197060121 compact_free_scanned 4747491606 compact_isolated 54281848 compact_stall 1797 compact_fail 1721 compact_success 76 So we have ended up with 16G freed pages in that last time period. Kswapd was sleeping throughout the time but direct reclaim was quite active. ~46GB pages recycled. Note that much more pages were freed which suggests there was quite a large memory allocation/free activity. One notable thing here is that there shouldn't be any reason to do the direct reclaim when kswapd itself doesn't do anything. It could be either blocked on something but I find it quite surprising to see it in that state for the whole 1500s time period or we are simply not low on free memory at all. That would point towards compaction triggered memory reclaim which account as the direct reclaim as well. The direct compaction triggered more than once a second in average. We shouldn't really reclaim unless we are low on memory but repeatedly failing compaction could just add up and reclaim a lot in the end. There seem to be quite a lot of low order request as per your trace buffer $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE We can safely rule out NOWAIT and ATOMIC because those do not reclaim. That leaves us with 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE by large the kernel stack allocations are in lead. You can put some relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of THP pages allocations. Just curious are you running on a NUMA machine? If yes [1] might be relevant. Other than that nothing really jumped at me. [1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@xxxxxxxxxx -- Michal Hocko SUSE Labs