Re: Caching/buffers become useless after some time

Marinko Catovic <marinko.catovic@xxxxxxxxx> · Tue, 30 Oct 2018 19:26:32 +0100

Am Di., 30. Okt. 2018 um 18:03 Uhr schrieb Vlastimil Babka <vbabka@xxxxxxx>:
>
> On 10/30/18 5:08 PM, Marinko Catovic wrote:
> >> One notable thing here is that there shouldn't be any reason to do the
> >> direct reclaim when kswapd itself doesn't do anything. It could be
> >> either blocked on something but I find it quite surprising to see it in
> >> that state for the whole 1500s time period or we are simply not low on
> >> free memory at all. That would point towards compaction triggered memory
> >> reclaim which account as the direct reclaim as well. The direct
> >> compaction triggered more than once a second in average. We shouldn't
> >> really reclaim unless we are low on memory but repeatedly failing
> >> compaction could just add up and reclaim a lot in the end. There seem to
> >> be quite a lot of low order request as per your trace buffer
>
> I realized that the fact that slabs grew so large might be very
> relevant. It means a lot of unmovable pages, and while they are slowly
> being freed, the remaining are scattered all over the memory, making it
> impossible to successfully compact, until the slabs are almost
> *completely* freed. It's in fact the theoretical worst case scenario for
> compaction and fragmentation avoidance. Next time it would be nice to
> also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so
> much there (probably dentries and inodes).

how would you like the results? as a job collecting those from 3 >
drop_caches until worst case, which may be 24 hours every 5 seconds,
or at what point in time?
Please note that I already provided them (see my response before) as a
one-time snapshot while being in the worst case;

cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ
cat /proc/slabinfo     https://pastebin.com/9ZPU3q7X

> The question is why the problems happened some time later after the
> unmovable pollution. The trace showed me that the structure of
> allocations wrt order+flags as Michal breaks them down below, is not
> significanly different in the last phase than in the whole trace.
> Possibly the state of memory gradually changed so that the various
> heuristics (fragindex, pageblock skip bits etc) resulted in compaction
> being tried more than initially, eventually hitting a very bad corner case.
>
> >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
> >>    1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
> >>  783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>  797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >>   93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
> >>  498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
> >>  243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
> >>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
> >>
> >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
> >> That leaves us with
> >>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>
> I suspect there are lots of short-lived processes, so these are probably
> rapidly recycled and not causing compaction.

Well yes, since it is about shared hosting there are lots of users,
running lots of scripts, perhaps 5-50 new forks and kills every
second, depending on load, hard to tell.

> It also seems to be pgd allocation (2 pages due to PTI) not kernel stack?

plain english, please? :)

> >>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
> >>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
> >>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
> >>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>
> I would again suspect those. IIRC we already confirmed earlier that THP
> defrag setting is madvise or madvise+defer, and there are
> madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag
> to plain 'defer'?

Yes, I think I mentioned this before. AFAIK it did not make
(immediate) changes, madvise is the current type.

> and there are madvise(MADV_HUGEPAGE) using processes?

Can't tell you that..

> >>
> >> by large the kernel stack allocations are in lead. You can put some
> >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
> >> THP pages allocations. Just curious are you running on a NUMA machine?
> >> If yes [1] might be relevant. Other than that nothing really jumped at
> >> me.
>
>
> > thanks a lot Vlastimil!
>
> And Michal :)
>
> > I would not really know whether this is a NUMA, it is some usual
> > server running with a i7-8700
> > and ECC RAM. How would I find out?
>
> Please provide /proc/zoneinfo and we'll see.

there you go: cat /proc/zoneinfo     https://pastebin.com/RMTwtXGr

> > So I should do CONFIG_VMAP_STACK=y and try that..?
>
> I suspect you already have it.

Yes true, the currently loaded kernel is with =y there.