Re: Caching/buffers become useless after some time

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 30 Oct 2018 18:00:23 +0100

On 10/30/18 5:08 PM, Marinko Catovic wrote:
>> One notable thing here is that there shouldn't be any reason to do the
>> direct reclaim when kswapd itself doesn't do anything. It could be
>> either blocked on something but I find it quite surprising to see it in
>> that state for the whole 1500s time period or we are simply not low on
>> free memory at all. That would point towards compaction triggered memory
>> reclaim which account as the direct reclaim as well. The direct
>> compaction triggered more than once a second in average. We shouldn't
>> really reclaim unless we are low on memory but repeatedly failing
>> compaction could just add up and reclaim a lot in the end. There seem to
>> be quite a lot of low order request as per your trace buffer

I realized that the fact that slabs grew so large might be very
relevant. It means a lot of unmovable pages, and while they are slowly
being freed, the remaining are scattered all over the memory, making it
impossible to successfully compact, until the slabs are almost
*completely* freed. It's in fact the theoretical worst case scenario for
compaction and fragmentation avoidance. Next time it would be nice to
also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so
much there (probably dentries and inodes).

The question is why the problems happened some time later after the
unmovable pollution. The trace showed me that the structure of
allocations wrt order+flags as Michal breaks them down below, is not
significanly different in the last phase than in the whole trace.
Possibly the state of memory gradually changed so that the various
heuristics (fragindex, pageblock skip bits etc) resulted in compaction
being tried more than initially, eventually hitting a very bad corner case.

>> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c
>>    1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO
>>  783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>  797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>>   93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC
>>  498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
>>  243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP
>>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE
>>
>> We can safely rule out NOWAIT and ATOMIC because those do not reclaim.
>> That leaves us with
>>    5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>     121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>      22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>  395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO

I suspect there are lots of short-lived processes, so these are probably
rapidly recycled and not causing compaction. It also seems to be pgd
allocation (2 pages due to PTI) not kernel stack?

>>    1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE
>>    3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE
>>      10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>     114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE
>>   67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE

I would again suspect those. IIRC we already confirmed earlier that THP
defrag setting is madvise or madvise+defer, and there are
madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag
to plain 'defer'?

>>
>> by large the kernel stack allocations are in lead. You can put some
>> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of
>> THP pages allocations. Just curious are you running on a NUMA machine?
>> If yes [1] might be relevant. Other than that nothing really jumped at
>> me.

> thanks a lot Vlastimil!

And Michal :)

> I would not really know whether this is a NUMA, it is some usual
> server running with a i7-8700
> and ECC RAM. How would I find out?

Please provide /proc/zoneinfo and we'll see.

> So I should do CONFIG_VMAP_STACK=y and try that..?

I suspect you already have it.