> The only way how kmemcg limit could help I can think of would be to
>> enforce metadata reclaim much more often. But that is rather a bad
>> workaround.
>
>would that have some significant performance impact?
>I would be willing to try if you think the idea is not thaaat bad.
>If so, could you please explain what to do?
>
>> > > Because a lot of FS metadata is fragmenting the memory and a large
>> > > number of high order allocations which want to be served reclaim a lot
>> > > of memory to achieve their gol. Considering a large part of memory is
>> > > fragmented by unmovable objects there is no other way than to use
>> > > reclaim to release that memory.
>> >
>> > Well it looks like the fragmentation issue gets worse. Is that enough to
>> > consider merging the slab defrag patchset and get some work done on inodes
>> > and dentries to make them movable (or use targetd reclaim)?
>
>> Is there anything to test?
>
>Are you referring to some known issue there, possibly directly related to mine?
>If so, I would be willing to test that patchset, if it makes into the kernel.org sources,
>or if I'd have to patch that manually.
>
>
>> Well, there are some drivers (mostly out-of-tree) which are high order
>> hungry. You can try to trace all allocations which with order > 0 and
>> see who that might be.
>> # mount -t tracefs none /debug/trace/
>> # echo stacktrace > /debug/trace/trace_options
>> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>> # cat /debug/trace/trace_pipe
>>
>> And later this to disable tracing.
>> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>I just had a major cache-useless situation, with like 100M/8G usage only
>and horrible performance. There you go:
>
>https://nofile.io/f/mmwVedaTFsd
>
>I think mysql occurs mostly, regardless of the binary name this is actually
>mariadb in version 10.1.
>
>> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
>> should be sufficient to drop metadata only.
>
>that is exactly what I am doing, I already mentioned that 1> does not
>make any difference at all 2> is the only way that helps.
>just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
>going up, as usual.
>
>
>2018-08-09 10:29 GMT+02:00 Marinko Catovic <marinko.catovic@xxxxxxxxx>:
>
>
>
> On Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
> > On Mon, 6 Aug 2018, Michal Hocko wrote:
> >
> > > Because a lot of FS metadata is fragmenting the memory and a large
> > > number of high order allocations which want to be served reclaim a lot
> > > of memory to achieve their gol. Considering a large part of memory is
> > > fragmented by unmovable objects there is no other way than to use
> > > reclaim to release that memory.
> >
> > Well it looks like the fragmentation issue gets worse. Is that enough to
> > consider merging the slab defrag patchset and get some work done on inodes
> > and dentries to make them movable (or use targetd reclaim)?
>
> Is there anything to test?
> --
> Michal Hocko
> SUSE Labs
>
>
> > [Please do not top-post]
>
> like this?
>
> > The only way how kmemcg limit could help I can think of would be to
> > enforce metadata reclaim much more often. But that is rather a bad
> > workaround.
>
> would that have some significant performance impact?
> I would be willing to try if you think the idea is not thaaat bad.
> If so, could you please explain what to do?
>
> > > > Because a lot of FS metadata is fragmenting the memory and a large
> > > > number of high order allocations which want to be served reclaim a lot
> > > > of memory to achieve their gol. Considering a large part of memory is
> > > > fragmented by unmovable objects there is no other way than to use
> > > > reclaim to release that memory.
> > >
> > > Well it looks like the fragmentation issue gets worse. Is that enough to
> > > consider merging the slab defrag patchset and get some work done on inodes
> > > and dentries to make them movable (or use targetd reclaim)?
>
> > Is there anything to test?
>
> Are you referring to some known issue there, possibly directly related to mine?
> If so, I would be willing to test that patchset, if it makes into the kernel.org sources,
> or if I'd have to patch that manually.
>
>
> > Well, there are some drivers (mostly out-of-tree) which are high order
> > hungry. You can try to trace all allocations which with order > 0 and
> > see who that might be.
> > # mount -t tracefs none /debug/trace/
> > # echo stacktrace > /debug/trace/trace_options
> > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
> > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
> > # cat /debug/trace/trace_pipe
> >
> > And later this to disable tracing.
> > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
> I just had a major cache-useless situation, with like 100M/8G usage only
> and horrible performance. There you go:
>
> https://nofile.io/f/mmwVedaTFsd
>
> I think mysql occurs mostly, regardless of the binary name this is actually
> mariadb in version 10.1.
>
> > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
> > should be sufficient to drop metadata only.
>
> that is exactly what I am doing, I already mentioned that 1> does not
> make any difference at all 2> is the only way that helps.
> just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
> going up, as usual.
>> enforce metadata reclaim much more often. But that is rather a bad
>> workaround.
>
>would that have some significant performance impact?
>I would be willing to try if you think the idea is not thaaat bad.
>If so, could you please explain what to do?
>
>> > > Because a lot of FS metadata is fragmenting the memory and a large
>> > > number of high order allocations which want to be served reclaim a lot
>> > > of memory to achieve their gol. Considering a large part of memory is
>> > > fragmented by unmovable objects there is no other way than to use
>> > > reclaim to release that memory.
>> >
>> > Well it looks like the fragmentation issue gets worse. Is that enough to
>> > consider merging the slab defrag patchset and get some work done on inodes
>> > and dentries to make them movable (or use targetd reclaim)?
>
>> Is there anything to test?
>
>Are you referring to some known issue there, possibly directly related to mine?
>If so, I would be willing to test that patchset, if it makes into the kernel.org sources,
>or if I'd have to patch that manually.
>
>
>> Well, there are some drivers (mostly out-of-tree) which are high order
>> hungry. You can try to trace all allocations which with order > 0 and
>> see who that might be.
>> # mount -t tracefs none /debug/trace/
>> # echo stacktrace > /debug/trace/trace_options
>> # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
>> # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
>> # cat /debug/trace/trace_pipe
>>
>> And later this to disable tracing.
>> # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
>I just had a major cache-useless situation, with like 100M/8G usage only
>and horrible performance. There you go:
>
>https://nofile.io/f/mmwVedaTFsd
>
>I think mysql occurs mostly, regardless of the binary name this is actually
>mariadb in version 10.1.
>
>> You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
>> should be sufficient to drop metadata only.
>
>that is exactly what I am doing, I already mentioned that 1> does not
>make any difference at all 2> is the only way that helps.
>just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
>going up, as usual.
>
>
>2018-08-09 10:29 GMT+02:00 Marinko Catovic <marinko.catovic@xxxxxxxxx>:
>
>
>
> On Mon 06-08-18 15:37:14, Cristopher Lameter wrote:
> > On Mon, 6 Aug 2018, Michal Hocko wrote:
> >
> > > Because a lot of FS metadata is fragmenting the memory and a large
> > > number of high order allocations which want to be served reclaim a lot
> > > of memory to achieve their gol. Considering a large part of memory is
> > > fragmented by unmovable objects there is no other way than to use
> > > reclaim to release that memory.
> >
> > Well it looks like the fragmentation issue gets worse. Is that enough to
> > consider merging the slab defrag patchset and get some work done on inodes
> > and dentries to make them movable (or use targetd reclaim)?
>
> Is there anything to test?
> --
> Michal Hocko
> SUSE Labs
>
>
> > [Please do not top-post]
>
> like this?
>
> > The only way how kmemcg limit could help I can think of would be to
> > enforce metadata reclaim much more often. But that is rather a bad
> > workaround.
>
> would that have some significant performance impact?
> I would be willing to try if you think the idea is not thaaat bad.
> If so, could you please explain what to do?
>
> > > > Because a lot of FS metadata is fragmenting the memory and a large
> > > > number of high order allocations which want to be served reclaim a lot
> > > > of memory to achieve their gol. Considering a large part of memory is
> > > > fragmented by unmovable objects there is no other way than to use
> > > > reclaim to release that memory.
> > >
> > > Well it looks like the fragmentation issue gets worse. Is that enough to
> > > consider merging the slab defrag patchset and get some work done on inodes
> > > and dentries to make them movable (or use targetd reclaim)?
>
> > Is there anything to test?
>
> Are you referring to some known issue there, possibly directly related to mine?
> If so, I would be willing to test that patchset, if it makes into the kernel.org sources,
> or if I'd have to patch that manually.
>
>
> > Well, there are some drivers (mostly out-of-tree) which are high order
> > hungry. You can try to trace all allocations which with order > 0 and
> > see who that might be.
> > # mount -t tracefs none /debug/trace/
> > # echo stacktrace > /debug/trace/trace_options
> > # echo "order>0" > /debug/trace/events/kmem/mm_page_alloc/filter
> > # echo 1 > /debug/trace/events/kmem/mm_page_alloc/enable
> > # cat /debug/trace/trace_pipe
> >
> > And later this to disable tracing.
> > # echo 0 > /debug/trace/events/kmem/mm_page_alloc/enable
>
> I just had a major cache-useless situation, with like 100M/8G usage only
> and horrible performance. There you go:
>
> https://nofile.io/f/mmwVedaTFsd
>
> I think mysql occurs mostly, regardless of the binary name this is actually
> mariadb in version 10.1.
>
> > You do not have to drop all caches. echo 2 > /proc/sys/vm/drop_caches
> > should be sufficient to drop metadata only.
>
> that is exactly what I am doing, I already mentioned that 1> does not
> make any difference at all 2> is the only way that helps.
> just 5 minutes after doing that the usage grew to 2GB/10GB and is steadily
> going up, as usual.
Is there anything you can read from these results?
The issue keeps occuring, the latest one was even totally unexpected in the morning hours,
causing downtime the entire morning until noon when I could check and drop the caches again.
I also reset O_DIRECT from mariadb to `fsync`, the new default in their latest release, hoping
that this would help, but it did not.
Before giving totally up I'd like to know whether there is any solution for this, where again I can
not believe that I am the only one affected. this *has* to affect anyone with similar a use case,
I do not see what is so special about mine. this is simply many users with many files, every
larger shared hosting provider should experience the totally same behaviour with the 4.x kernel branch.