Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic <marinko.catovic@xxxxxxxxx>: > > >> > one host is at a healthy state right now, I'd run that over there immediately. >> >> Let's see what we can get from here. > > > oh well, that went fast. actually with having low values for buffers (around 100MB) with caches > around 20G or so, the performance was nevertheless super-low, I really had to drop > the caches right now. This is the first time I see it with caches >10G happening, but hopefully > this also provides a clue for you. > > Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow > caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB, > after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime. > > If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was > on defer now for days, was a mistake, then please tell me. > > here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz > trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz > There we go again. First of all, I have set up this monitoring on 1 host, as a matter of fact it did not occur on that single one for days and weeks now, so I set this up again on all the hosts and it just happened again on another one. This issue is far from over, even when upgrading to the latest 4.18.12 https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz Please note: the trace_pipe is quite big in size, but it covers a full-RAM to unused-RAM within just ~24 hours, the measurements were initiated right after echo 3 > drop_caches and stopped when the RAM was unused aka re-used after another echo 3 in the end. This issue is alive for about half a year now, any suggestions, hints or solutions are greatly appreciated, again, I can not possibly be the only one experiencing this, I just may be among the few ones who actually notice this and are indeed suffering from very poor performance with lots of I/O on cache/buffers. Also, I'd like to ask for a workaround until this is fixed someday: echo 3 > drop_caches can take a very long time when the host is busy with I/O in the background. According to some resources in the net I discovered that dropping caches operates until some lower threshold is reached, which is less and less likely, when the host is really busy. Could one point out what threshold this is perhaps? I was thinking of e.g. mm/vmscan.c 549 void drop_slab_node(int nid) 550 { 551 unsigned long freed; 552 553 do { 554 struct mem_cgroup *memcg = NULL; 555 556 freed = 0; 557 do { 558 freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); 559 } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); 560 } while (freed > 10); 561 } ..would it make sense to increase > 10 here with, for example, > 100 ? I could easily adjust this, or any other relevant threshold, since I am compiling the kernel in use. I'd just like it to be able to finish dropping caches to achieve the workaround here until this issue is fixed, which as mentioned, can take hours on a busy host, causing the host to hang (having low performance) since buffers/caches are not used at that time while drop_caches is being set to 3, until that freeing up is finished.