Re: Caching/buffers become useless after some time

Marinko Catovic <marinko.catovic@xxxxxxxxx> · Mon, 22 Oct 2018 03:19:57 +0200

Am Mi., 29. Aug. 2018 um 18:44 Uhr schrieb Marinko Catovic
<marinko.catovic@xxxxxxxxx>:
>
>
>> > one host is at a healthy state right now, I'd run that over there immediately.
>>
>> Let's see what we can get from here.
>
>
> oh well, that went fast. actually with having low values for buffers (around 100MB) with caches
> around 20G or so, the performance was nevertheless super-low, I really had to drop
> the caches right now. This is the first time I see it with caches >10G happening, but hopefully
> this also provides a clue for you.
>
> Just after starting the stats I reset from previously defer to madvise - I suspect that this somehow
> caused the rapid reaction, since a few minutes later I saw that the free RAM jumped from 5GB to 10GB,
> after that I went afk, returning to the pc since my monitoring systems went crazy telling me about downtime.
>
> If you think changing /sys/kernel/mm/transparent_hugepage/defrag back to its default, while it was
> on defer now for days, was a mistake, then please tell me.
>
> here you go: https://nofile.io/f/VqRg644AT01/vmstat.tar.gz
> trace_pipe: https://nofile.io/f/wFShvZScpvn/trace_pipe.gz
>

There we go again.

First of all, I have set up this monitoring on 1 host, as a matter of
fact it did not occur on that single
one for days and weeks now, so I set this up again on all the hosts
and it just happened again on another one.

This issue is far from over, even when upgrading to the latest 4.18.12

https://nofile.io/f/z2KeNwJSMDj/vmstat-2.zip
https://nofile.io/f/5ezPUkFWtnx/trace_pipe-2.gz

Please note: the trace_pipe is quite big in size, but it covers a
full-RAM to unused-RAM within just ~24 hours,
the measurements were initiated right after echo 3 > drop_caches and
stopped when the RAM was unused
aka re-used after another echo 3 in the end.

This issue is alive for about half a year now, any suggestions, hints
or solutions are greatly appreciated,
again, I can not possibly be the only one experiencing this, I just
may be among the few ones who actually
notice this and are indeed suffering from very poor performance with
lots of I/O on cache/buffers.

Also, I'd like to ask for a workaround until this is fixed someday:
echo 3 > drop_caches can take a very
long time when the host is busy with I/O in the background. According
to some resources in the net I discovered
that dropping caches operates until some lower threshold is reached,
which is less and less likely, when the
host is really busy. Could one point out what threshold this is perhaps?
I was thinking of e.g. mm/vmscan.c

 549 void drop_slab_node(int nid)
 550 {
 551         unsigned long freed;
 552
 553         do {
 554                 struct mem_cgroup *memcg = NULL;
 555
 556                 freed = 0;
 557                 do {
 558                         freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
 559                 } while ((memcg = mem_cgroup_iter(NULL, memcg,
NULL)) != NULL);
 560         } while (freed > 10);
 561 }

..would it make sense to increase > 10 here with, for example, > 100 ?
I could easily adjust this, or any other relevant threshold, since I
am compiling the kernel in use.

I'd just like it to be able to finish dropping caches to achieve the
workaround here until this issue is fixed,
which as mentioned, can take hours on a busy host, causing the host to
hang (having low performance) since
buffers/caches are not used at that time while drop_caches is being
set to 3, until that freeing up is finished.