Re: Caching/buffers become useless after some time

Michal Hocko <mhocko@xxxxxxxx> · Tue, 30 Oct 2018 16:30:45 +0100

On Tue 30-10-18 14:44:27, Vlastimil Babka wrote:
> On 10/22/18 3:19 AM, Marinko Catovic wrote:
> > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
[...]
> >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
> >>
> > 
> > There we go again.
> > 
> > First of all, I have set up this monitoring on 1 host, as a matter of
> > fact it did not occur on that single
> > one for days and weeks now, so I set this up again on all the hosts
> > and it just happened again on another one.
> > 
> > This issue is far from over, even when upgrading to the latest 4.18.12
> > 
> > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
> 
> I have plot the vmstat using the attached script, and got the attached
> plots. X axis are the vmstat snapshots, almost 14k of them, each for 5
> seconds, so almost 19 hours. I can see the following phases:

Thanks a lot. I like the script much!

[...]

> 12000 - end:
> - free pages growing sharply
> - page cache declining sharply
> - slab still slowly declining

$ cat filter 
pgfree
pgsteal_
pgscan_
compact
nr_free_pages

$ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}'
nr_free_pages 4216371
pgfree 267884025
pgsteal_kswapd 0
pgsteal_direct 11890416
pgscan_kswapd 0
pgscan_direct 11937805
compact_migrate_scanned 2197060121
compact_free_scanned 4747491606
compact_isolated 54281848
compact_stall 1797
compact_fail 1721
compact_success 76

So we have ended up with 16G freed pages in that last time period.
Kswapd was sleeping throughout the time but direct reclaim was quite
active. ~46GB pages recycled. Note that much more pages were freed which
suggests there was quite a large memory allocation/free activity.

One notable thing here is that there shouldn't be any reason to do the
direct reclaim when kswapd itself doesn't do anything. It could be
either blocked on something but I find it quite surprising to see it in
that state for the whole 1500s time period or we are simply not low on
free memory at all. That would point towards compaction triggered memory
reclaim which account as the direct reclaim as well. The direct
compaction triggered more than once a second in average. We shouldn't
really reclaim unless we are low on memory but repeatedly failing
compaction could just add up and reclaim a lot in the end. There seem to
be quite a lot of low order request as per your trace buffer

$ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
   1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
   5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
    121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
     22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
   1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
   3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
  93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
     10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
    114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
  67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE

We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
That leaves us with 
   5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
    121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
     22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
   1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
   3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
     10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
    114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
  67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE

by large the kernel stack allocations are in lead. You can put some
relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
THP pages allocations. Just curious are you running on a NUMA machine?
If yes [1] might be relevant. Other than that nothing really jumped at
me.

[1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@xxxxxxxxxx
-- 
Michal Hocko
SUSE Labs