Hi, On 03.11.2022 18:14, Shakeel Butt wrote: > On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote: >> On 24.10.2022 07:28, Shakeel Butt wrote: >>> Currently mm_struct maintains rss_stats which are updated on page fault >>> and the unmapping codepaths. For page fault codepath the updates are >>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. >>> The reason for caching is performance for multithreaded applications >>> otherwise the rss_stats updates may become hotspot for such >>> applications. >>> >>> However this optimization comes with the cost of error margin in the rss >>> stats. The rss_stats for applications with large number of threads can >>> be very skewed. At worst the error margin is (nr_threads * 64) and we >>> have a lot of applications with 100s of threads, so the error margin can >>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. >>> >>> Recently we started seeing the unbounded errors for rss_stats for >>> specific applications which use TCP rx0cp. It seems like >>> vm_insert_pages() codepath does not sync rss_stats at all. >>> >>> This patch converts the rss_stats into percpu_counter to convert the >>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). >>> However this conversion enable us to get the accurate stats for >>> situations where accuracy is more important than the cpu cost. Though >>> this patch does not make such tradeoffs. >>> >>> Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> >> This patch landed recently in linux-next as commit d59f19a7a068 ("mm: >> convert mm's rss stats into percpu_counter"). Unfortunately it causes a >> regression on my test systems. I've noticed that it triggers a 'BUG: Bad >> rss-counter state' warning from time to time for random processes. This >> is somehow related to CPU hot-plug and/or system suspend/resume. The >> easiest way to reproduce this issue (although not always) on my test >> systems (ARM or ARM64 based) is to run the following commands: >> >> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >> >$i/online; >> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 >> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 >> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 >> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 >> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 >> >> Let me know if I can help debugging this somehow or testing a fix. >> > Hi Marek, > > Thanks for the report. It seems like there is a race between > for_each_online_cpu() in __percpu_counter_sum() and > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for > percpu_counter users but for check_mm() is not happy with this race. Can > you please try the following patch: > > > From: Shakeel Butt <shakeelb@xxxxxxxxxx> > Date: Thu, 3 Nov 2022 06:05:13 +0000 > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum > interface > > percpu_counter_sum can race with cpu offlining. Add a new interface > which does not race with it and use that for check_mm(). > --- > include/linux/percpu_counter.h | 11 +++++++++++ > kernel/fork.c | 2 +- > lib/percpu_counter.c | 24 ++++++++++++++++++------ > 3 files changed, 30 insertions(+), 7 deletions(-) Yes, this seems to fix the issue I've reported. Feel free to add: Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h > index bde6c4c1f405..3070c1043acf 100644 > --- a/include/linux/percpu_counter.h > +++ b/include/linux/percpu_counter.h > @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount); > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, > s32 batch); > s64 __percpu_counter_sum(struct percpu_counter *fbc); > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc); > int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); > void percpu_counter_sync(struct percpu_counter *fbc); > > @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) > return __percpu_counter_sum(fbc); > } > > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_all(fbc); > +} > + > static inline s64 percpu_counter_read(struct percpu_counter *fbc) > { > return fbc->count; > @@ -193,6 +199,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) > return percpu_counter_read(fbc); > } > > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return percpu_counter_read(fbc); > +} > + > static inline bool percpu_counter_initialized(struct percpu_counter *fbc) > { > return true; > diff --git a/kernel/fork.c b/kernel/fork.c > index 9c32f593ef11..7d6f510cf397 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm) > "Please make sure 'struct resident_page_types[]' is updated as well"); > > for (i = 0; i < NR_MM_COUNTERS; i++) { > - long x = percpu_counter_sum(&mm->rss_stat[i]); > + long x = percpu_counter_sum_all(&mm->rss_stat[i]); > > if (unlikely(x)) > pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > index ed610b75dc32..f26a1a5df399 100644 > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -117,11 +117,8 @@ void percpu_counter_sync(struct percpu_counter *fbc) > } > EXPORT_SYMBOL(percpu_counter_sync); > > -/* > - * Add up all the per-cpu counts, return the result. This is a more accurate > - * but much slower version of percpu_counter_read_positive() > - */ > -s64 __percpu_counter_sum(struct percpu_counter *fbc) > +static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc, > + const struct cpumask *cpu_mask) > { > s64 ret; > int cpu; > @@ -129,15 +126,30 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc) > > raw_spin_lock_irqsave(&fbc->lock, flags); > ret = fbc->count; > - for_each_online_cpu(cpu) { > + for_each_cpu(cpu, cpu_mask) { > s32 *pcount = per_cpu_ptr(fbc->counters, cpu); > ret += *pcount; > } > raw_spin_unlock_irqrestore(&fbc->lock, flags); > return ret; > } > + > +/* > + * Add up all the per-cpu counts, return the result. This is a more accurate > + * but much slower version of percpu_counter_read_positive() > + */ > +s64 __percpu_counter_sum(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_mask(fbc, cpu_online_mask); > +} > EXPORT_SYMBOL(__percpu_counter_sum); > > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_mask(fbc, cpu_possible_mask); > +} > +EXPORT_SYMBOL(__percpu_counter_sum_all); > + > int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp, > struct lock_class_key *key) > { Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland