Am Di., 30. Okt. 2018 um 18:03 Uhr schrieb Vlastimil Babka <vbabka@xxxxxxx>: > > On 10/30/18 5:08 PM, Marinko Catovic wrote: > >> One notable thing here is that there shouldn't be any reason to do the > >> direct reclaim when kswapd itself doesn't do anything. It could be > >> either blocked on something but I find it quite surprising to see it in > >> that state for the whole 1500s time period or we are simply not low on > >> free memory at all. That would point towards compaction triggered memory > >> reclaim which account as the direct reclaim as well. The direct > >> compaction triggered more than once a second in average. We shouldn't > >> really reclaim unless we are low on memory but repeatedly failing > >> compaction could just add up and reclaim a lot in the end. There seem to > >> be quite a lot of low order request as per your trace buffer > > I realized that the fact that slabs grew so large might be very > relevant. It means a lot of unmovable pages, and while they are slowly > being freed, the remaining are scattered all over the memory, making it > impossible to successfully compact, until the slabs are almost > *completely* freed. It's in fact the theoretical worst case scenario for > compaction and fragmentation avoidance. Next time it would be nice to > also gather /proc/pagetypeinfo, and /proc/slabinfo to see what grew so > much there (probably dentries and inodes). how would you like the results? as a job collecting those from 3 > drop_caches until worst case, which may be 24 hours every 5 seconds, or at what point in time? Please note that I already provided them (see my response before) as a one-time snapshot while being in the worst case; cat /proc/pagetypeinfo https://pastebin.com/W1sJscsZ cat /proc/slabinfo https://pastebin.com/9ZPU3q7X > The question is why the problems happened some time later after the > unmovable pollution. The trace showed me that the structure of > allocations wrt order+flags as Michal breaks them down below, is not > significanly different in the last phase than in the whole trace. > Possibly the state of memory gradually changed so that the various > heuristics (fragindex, pageblock skip bits etc) resulted in compaction > being tried more than initially, eventually hitting a very bad corner case. > > >> $ grep order trace-last-phase | sed 's@.*\(order=[0-9]*\).*gfp_flags=\(.*\)@\1 \2@' | sort | uniq -c > >> 1238 order=1 __GFP_HIGH|__GFP_ATOMIC|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > >> 783055 order=1 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 797255 order=2 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 93524 order=3 GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC > >> 498148 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT > >> 243563 order=3 GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP > >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > >> > >> We can safely rule out NOWAIT and ATOMIC because those do not reclaim. > >> That leaves us with > >> 5812 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 121 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 22 order=1 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 395910 order=1 GFP_KERNEL_ACCOUNT|__GFP_ZERO > > I suspect there are lots of short-lived processes, so these are probably > rapidly recycled and not causing compaction. Well yes, since it is about shared hosting there are lots of users, running lots of scripts, perhaps 5-50 new forks and kills every second, depending on load, hard to tell. > It also seems to be pgd allocation (2 pages due to PTI) not kernel stack? plain english, please? :) > >> 1060 order=1 __GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_THISNODE > >> 3278 order=2 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE > >> 10 order=4 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 114 order=7 __GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE > >> 67621 order=9 GFP_TRANSHUGE|__GFP_THISNODE > > I would again suspect those. IIRC we already confirmed earlier that THP > defrag setting is madvise or madvise+defer, and there are > madvise(MADV_HUGEPAGE) using processes? Did you ever try changing defrag > to plain 'defer'? Yes, I think I mentioned this before. AFAIK it did not make (immediate) changes, madvise is the current type. > and there are madvise(MADV_HUGEPAGE) using processes? Can't tell you that.. > >> > >> by large the kernel stack allocations are in lead. You can put some > >> relief by enabling CONFIG_VMAP_STACK. There is alos a notable number of > >> THP pages allocations. Just curious are you running on a NUMA machine? > >> If yes [1] might be relevant. Other than that nothing really jumped at > >> me. > > > > thanks a lot Vlastimil! > > And Michal :) > > > I would not really know whether this is a NUMA, it is some usual > > server running with a i7-8700 > > and ECC RAM. How would I find out? > > Please provide /proc/zoneinfo and we'll see. there you go: cat /proc/zoneinfo https://pastebin.com/RMTwtXGr > > So I should do CONFIG_VMAP_STACK=y and try that..? > > I suspect you already have it. Yes true, the currently loaded kernel is with =y there.