Hi, On 16.07.2021 17:14, Shakeel Butt wrote: > Hi Marek > > On Fri, Jul 16, 2021 at 8:03 AM Marek Szyprowski > <m.szyprowski@xxxxxxxxxxx> wrote: >> Hi, >> >> On 14.07.2021 03:39, Shakeel Butt wrote: >>> At the moment memcg stats are read in four contexts: >>> >>> 1. memcg stat user interfaces >>> 2. dirty throttling >>> 3. page fault >>> 4. memory reclaim >>> >>> Currently the kernel flushes the stats for first two cases. Flushing the >>> stats for remaining two casese may have performance impact. Always >>> flushing the memcg stats on the page fault code path may negatively >>> impacts the performance of the applications. In addition flushing in the >>> memory reclaim code path, though treated as slowpath, can become the >>> source of contention for the global lock taken for stat flushing because >>> when system or memcg is under memory pressure, many tasks may enter the >>> reclaim path. >>> >>> This patch uses following mechanisms to solve these challenges: >>> >>> 1. Periodically flush the stats from root memcg every 2 seconds. This >>> will time limit the out of sync stats. >>> >>> 2. Asynchronously flush the stats after fixed number of stat updates. >>> In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for >>> 2 seconds. >>> >>> 3. For avoiding thundering herd to flush the stats particularly from the >>> memory reclaim context, introduce memcg local spinlock and let only one >>> flusher active at a time. This could have been done through >>> cgroup_rstat_lock lock but that lock is used by other subsystem and for >>> userspace reading memcg stats. So, it is better to keep flushers >>> introduced by this patch decoupled from cgroup_rstat_lock. >>> >>> Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> >> This patch landed in today's linux-next (next-20210716) as commit >> 42265e014ac7 ("memcg: infrastructure to flush memcg stats"). On my test >> system's I found that it triggers a kernel BUG on all ARM64 boards: >> >> BUG: sleeping function called from invalid context at >> kernel/cgroup/rstat.c:200 >> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 7, name: >> kworker/u8:0 >> 3 locks held by kworker/u8:0/7: >> #0: ffff00004000c938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: >> process_one_work+0x200/0x718 >> #1: ffff80001334bdd0 ((stats_flush_dwork).work){+.+.}-{0:0}, at: >> process_one_work+0x200/0x718 >> #2: ffff8000124f6d40 (stats_flush_lock){+.+.}-{2:2}, at: >> mem_cgroup_flush_stats+0x20/0x48 >> CPU: 2 PID: 7 Comm: kworker/u8:0 Tainted: G W 5.14.0-rc1+ #3713 >> Hardware name: Raspberry Pi 4 Model B (DT) >> Workqueue: events_unbound flush_memcg_stats_dwork >> Call trace: >> dump_backtrace+0x0/0x1d0 >> show_stack+0x14/0x20 >> dump_stack_lvl+0x88/0xb0 >> dump_stack+0x14/0x2c >> ___might_sleep+0x1dc/0x200 >> __might_sleep+0x4c/0x88 >> cgroup_rstat_flush+0x2c/0x58 >> mem_cgroup_flush_stats+0x34/0x48 >> flush_memcg_stats_dwork+0xc/0x38 >> process_one_work+0x2a8/0x718 >> worker_thread+0x48/0x460 >> kthread+0x12c/0x160 >> ret_from_fork+0x10/0x18 >> >> This can be also reproduced with QEmu. Please let me know if I can help >> fixing this issue. >> > Thanks for the report. The issue can be fixed by changing > cgroup_rstat_flush() to cgroup_rstat_flush_irqsafe() in > mem_cgroup_flush_stats(). I will send out the updated patch in a > couple of hours after a bit more testing. Right, this fixes the issue on my test systems. Feel free to add: Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> to the fixup patch if the target kernel tree won't be rebased and the original patch (42265e014ac7) stays. Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland