Re: [Bug 190841] New: [REGRESSION] Intensive Memory CGroup removal leads to high load average 10+

Vladyslav Frolov <frolvlad@xxxxxxxxx> · Thu, 12 Jan 2017 15:55:59 +0200

Indeed, `cgroup.memory=nokmem` works around the high load average on
all the kernels!

4.10rc2 kernel without `cgroup.memory=nokmem` behaves much better than
4.7-4.9 kernels, yet it still reaches LA ~6 using my reproduction
script, while LA <=1.0 is expected. 4.10rc2 feels like 4.6, which I
described as "seminormal".

Running the reproduction script 3000 times gives the following results:

* 4.4 kernel takes 13 seconds to complete and LA <= 1.0
* 4.6-4.10rc2 kernels with `cgroup.memory=nokmem'` also takes 13
seconds to complete and LA <= 1.0
* 4.6 kernel takes 25 seconds to complete and LA ~= 5
* 4.7-4.9 kernels take 6-9 minutes (yes, 25-40 times slower than with
`nokmem`) to complete and LA > 20
* 4.10rc2 kernel takes 60 seconds (4 times slower than with `nokmem`)
to complete and LA ~= 6

On 6 January 2017 at 18:28, Vladimir Davydov <vdavydov@xxxxxxxxxxxxx> wrote:
> Hello,
>
> The issue does look like kmemcg related - see below.
>
> On Wed, Jan 04, 2017 at 05:30:37PM -0800, Andrew Morton wrote:
>
>> > * Ubuntu 4.4.0-57 kernel works fine
>> > * Mainline 4.4.39 and below seem to work just fine -
>> > https://youtu.be/tGD6sfwa-3c
>
> kmemcg is disabled
>
>> > * Mainline 4.6.7 kernel behaves seminormal, load average is higher than on 4.4,
>> > but not as bad as on 4.7+ - https://youtu.be/-CyhmkkPbKE
>
> 4.6+
>
> b313aeee25098 mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy
>
> kmemcg is enabled by default for all cgroups, which introduces extra
> overhead to memcg destruction path
>
>> > * Mainline 4.7.0-rc1 kernel is the first kernel after 4.6.7 that is available
>> > in binaries, so I chose to test it and it doesn't play nicely -
>> > https://youtu.be/C_J5es74Ars
>
> 4.7+
>
> 81ae6d03952c1 mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink()
>
> kick_all_cpus_sync(), which was used for synchronizing slub cache
> destruction before this commit, turns out to be too disruptive on big
> SMP machines as it generates a lot of IPIs, so it is replaced with more
> lightweight synchronize_sched(). The latter, however, blocks cgroup
> rmdir under the slab_mutex for relatively long, resulting in higher load
> average as well as stalling other processes trying to create or destroy
> a kmem cache.
>
>> > * Mainline 4.9.0 kernel still doesn't play nicely -
>> > https://youtu.be/_o17U5x3bmY
>
> The above-mentioned issue is still unfixed.
>
>> >
>> > OTHER NOTES:
>> > 1. Using VirtualBox I have noticed that this bug only reproducible when I have
>> > 2+ CPU cores!
>
> synchronize_sched() is a no-op on UP machines, which explains why on a
> UP machine the problems goes away.
>
> If I'm correct, the issue must have been fixed in 4.10, which is yet to
> be released:
>
> 89e364db71fb5 slub: move synchronize_sched out of slab_mutex on shrink
>
> You can workaround it on older kernels by turning kmem accounting off.
> To do that, append 'cgroup.memory=nokmem' to the kernel command line.
> Alternatively, you can try to recompile the kernel choosing SLAB as the
> slab allocator, because only SLUB is affected IIRC.
>
> FWIW I tried the script you provided in a 4 CPU VM running 4.10-rc2 and
> didn't notice any significant stalls or latency spikes. Could you please
> check if this kernel fixes your problem? If it does it might be worth
> submitting the patch to stable..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>