On Thu, Nov 3, 2022 at 4:02 PM Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> wrote: > > Hi, > > On 03.11.2022 18:14, Shakeel Butt wrote: > > On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote: > >> On 24.10.2022 07:28, Shakeel Butt wrote: > >>> Currently mm_struct maintains rss_stats which are updated on page fault > >>> and the unmapping codepaths. For page fault codepath the updates are > >>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > >>> The reason for caching is performance for multithreaded applications > >>> otherwise the rss_stats updates may become hotspot for such > >>> applications. > >>> > >>> However this optimization comes with the cost of error margin in the rss > >>> stats. The rss_stats for applications with large number of threads can > >>> be very skewed. At worst the error margin is (nr_threads * 64) and we > >>> have a lot of applications with 100s of threads, so the error margin can > >>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > >>> > >>> Recently we started seeing the unbounded errors for rss_stats for > >>> specific applications which use TCP rx0cp. It seems like > >>> vm_insert_pages() codepath does not sync rss_stats at all. > >>> > >>> This patch converts the rss_stats into percpu_counter to convert the > >>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > >>> However this conversion enable us to get the accurate stats for > >>> situations where accuracy is more important than the cpu cost. Though > >>> this patch does not make such tradeoffs. > >>> > >>> Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > >> This patch landed recently in linux-next as commit d59f19a7a068 ("mm: > >> convert mm's rss stats into percpu_counter"). Unfortunately it causes a > >> regression on my test systems. I've noticed that it triggers a 'BUG: Bad > >> rss-counter state' warning from time to time for random processes. This > >> is somehow related to CPU hot-plug and/or system suspend/resume. The > >> easiest way to reproduce this issue (although not always) on my test > >> systems (ARM or ARM64 based) is to run the following commands: > >> > >> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 > >> >$i/online; > >> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 > >> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 > >> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 > >> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 > >> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 > >> > >> Let me know if I can help debugging this somehow or testing a fix. > >> > > Hi Marek, > > > > Thanks for the report. It seems like there is a race between > > for_each_online_cpu() in __percpu_counter_sum() and > > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for > > percpu_counter users but for check_mm() is not happy with this race. Can > > you please try the following patch: > > > > > > From: Shakeel Butt <shakeelb@xxxxxxxxxx> > > Date: Thu, 3 Nov 2022 06:05:13 +0000 > > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum > > interface > > > > percpu_counter_sum can race with cpu offlining. Add a new interface > > which does not race with it and use that for check_mm(). > > --- > > include/linux/percpu_counter.h | 11 +++++++++++ > > kernel/fork.c | 2 +- > > lib/percpu_counter.c | 24 ++++++++++++++++++------ > > 3 files changed, 30 insertions(+), 7 deletions(-) > > > Yes, this seems to fix the issue I've reported. Feel free to add: > > Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> > > Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> > > Thanks a lot Marek. I will send out a formal patch later with your reported-by and tested-by tags.