On 10/30/18 5:08 PM, Marinko Catovic wrote: >> One notable thing here is that there shouldn't be any reason to do the >> direct reclaim when kswapd itself doesn't do anything. It could be >> either blocked on something but I find it quite surprising to see it in >> that state for the whole 1500s time period or we are simply not low on >> free memory at all. That would point towards compaction triggered memory >> reclaim which account as the direct reclaim as well. The direct >> compaction triggered more than once a second in average. We shouldn't >> really reclaim unless we are low on memory but repeatedly failing >> compaction could just add up and reclaim a lot in the end. There seem to >> be quite a lot of low order request as per your trace buffer I realized that the fact that slabs grew so large might be very relevant. It means a lot of unmovable pages, and while they are slowly being freed, the remaining are scattered all over the memory, making it impossible to successfully compact, until the slabs are almost *completely* freed. It's in fact the theoretical worst case scenario for compaction and fragmentation avoidance. Next time it would be nice to also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so much there (probably dentries and inodes). The question is why the problems happened some time later after the unmovable pollution. The trace showed me that the structure of allocations wrt order+flags as Michal breaks them down below, is not significanly different in the last phase than in the whole trace. Possibly the state of memory gradually changed so that the various heuristics (fragindex, pageblock skip bits etc) resulted in compaction being tried more than initially, eventually hitting a very bad corner case. >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c >> 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO >> 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT >> 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC >> 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT >> 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE >> >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim. >> That leaves us with >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO I suspect there are lots of short-lived processes, so these are probably rapidly recycled and not causing compaction. It also seems to be pgd allocation (2 pages due to PTI) not kernel stack? >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE I would again suspect those. IIRC we already confirmed earlier that THP defrag setting is madvise or madvise+defer, and there are madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag to plain 'defer'? >> >> by large the kernel stack allocations are in lead. You can put some >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of >> THP pages allocations. Just curious are you running on a NUMA machine? >> If yes [1] might be relevant. Other than that nothing really jumped at >> me. > thanks a lot Vlastimil! And Michal :) > I would not really know whether this is a NUMA, it is some usual > server running with a i7-8700 > and ECC RAM. How would I find out? Please provide /proc/zoneinfo and we'll see. > So I should do CONFIG_VMAP_STACK=y and try that..? I suspect you already have it.