Re: Caching/buffers become useless after some time

Marinko Catovic <marinko.catovic@xxxxxxxxx> · Tue, 30 Oct 2018 17:08:53 +0100

Am Di., 30. Okt. 2018 um 16:30 Uhr schrieb Michal Hocko <mhocko@xxxxxxxx>:
>
> On Tue 30-10-18 14:44:27, Vlastimil Babka wrote:
> > On 10/22/18 3:19 AM, Marinko Catovic wrote:
> > > Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
> [...]
> > >> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> > >> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
> > >>
> > >
> > > There we go again.
> > >
> > > First of all, I have set up this monitoring on 1 host, as a matter of
> > > fact it did not occur on that single
> > > one for days and weeks now, so I set this up again on all the hosts
> > > and it just happened again on another one.
> > >
> > > This issue is far from over, even when upgrading to the latest 4.18.12
> > >
> > > https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
> > > https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz
> >
> > I have plot the vmstat using the attached script, and got the attached
> > plots. X axis are the vmstat snapshots, almost 14k of them, each for 5
> > seconds, so almost 19 hours. I can see the following phases:
>
> Thanks a lot. I like the script much!
>
> [...]
>
> > 12000 - end:
> > - free pages growing sharply
> > - page cache declining sharply
> > - slab still slowly declining
>
> $ cat filter
> pgfree
> pgsteal_
> pgscan_
> compact
> nr_free_pages
>
> $ grep -f filter -h vmstat.1539866837 vmstat.1539874353 | awk '{if (c[$1]) {printf "%s %d\n", $1, $2-c[$1]}; c[$1]=$2}'
> nr_free_pages 4216371
> pgfree 267884025
> pgsteal_kswapd 0
> pgsteal_direct 11890416
> pgscan_kswapd 0
> pgscan_direct 11937805
> compact_migrate_scanned 2197060121
> compact_free_scanned 4747491606
> compact_isolated 54281848
> compact_stall 1797
> compact_fail 1721
> compact_success 76
>
> So we have ended up with 16G freed pages in that last time period.
> Kswapd was sleeping throughout the time but direct reclaim was quite
> active. ~46GB pages recycled. Note that much more pages were freed which
> suggests there was quite a large memory allocation/free activity.
>
> One notable thing here is that there shouldn't be any reason to do the
> direct reclaim when kswapd itself doesn't do anything. It could be
> either blocked on something but I find it quite surprising to see it in
> that state for the whole 1500s time period or we are simply not low on
> free memory at all. That would point towards compaction triggered memory
> reclaim which account as the direct reclaim as well. The direct
> compaction triggered more than once a second in average. We shouldn't
> really reclaim unless we are low on memory but repeatedly failing
> compaction could just add up and reclaim a lot in the end. There seem to
> be quite a lot of low order request as per your trace buffer
>
> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
>    1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>  783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>  797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>   93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
>  498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>  243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
> That leaves us with
>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> by large the kernel stack allocations are in lead. You can put some
> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
> THP pages allocations. Just curious are you running on a NUMA machine?
> If yes [1] might be relevant. Other than that nothing really jumped at
> me.
>
> [1] http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@xxxxxxxxxx
> --
> Michal Hocko
> SUSE Labs

thanks a lot Vlastimil!

I would not really know whether this is a NUMA, it is some usual
server running with a i7-8700
and ECC RAM. How would I find out?
So I should do CONFIG_VMAP_STACK=y and try that..?