On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote: > Hi > > On 24.10.2022 07:28, Shakeel Butt wrote: > > Currently mm_struct maintains rss_stats which are updated on page fault > > and the unmapping codepaths. For page fault codepath the updates are > > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > > The reason for caching is performance for multithreaded applications > > otherwise the rss_stats updates may become hotspot for such > > applications. > > > > However this optimization comes with the cost of error margin in the rss > > stats. The rss_stats for applications with large number of threads can > > be very skewed. At worst the error margin is (nr_threads * 64) and we > > have a lot of applications with 100s of threads, so the error margin can > > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > > > Recently we started seeing the unbounded errors for rss_stats for > > specific applications which use TCP rx0cp. It seems like > > vm_insert_pages() codepath does not sync rss_stats at all. > > > > This patch converts the rss_stats into percpu_counter to convert the > > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > > However this conversion enable us to get the accurate stats for > > situations where accuracy is more important than the cpu cost. Though > > this patch does not make such tradeoffs. > > > > Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> > > This patch landed recently in linux-next as commit d59f19a7a068 ("mm: > convert mm's rss stats into percpu_counter"). Unfortunately it causes a > regression on my test systems. I've noticed that it triggers a 'BUG: Bad > rss-counter state' warning from time to time for random processes. This > is somehow related to CPU hot-plug and/or system suspend/resume. The > easiest way to reproduce this issue (although not always) on my test > systems (ARM or ARM64 based) is to run the following commands: > > root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 > >$i/online; > BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 > BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 > BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 > BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 > BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 > > Let me know if I can help debugging this somehow or testing a fix. > Hi Marek, Thanks for the report. It seems like there is a race between for_each_online_cpu() in __percpu_counter_sum() and percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for percpu_counter users but for check_mm() is not happy with this race. Can you please try the following patch: From: Shakeel Butt <shakeelb@xxxxxxxxxx> Date: Thu, 3 Nov 2022 06:05:13 +0000 Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum interface percpu_counter_sum can race with cpu offlining. Add a new interface which does not race with it and use that for check_mm(). --- include/linux/percpu_counter.h | 11 +++++++++++ kernel/fork.c | 2 +- lib/percpu_counter.c | 24 ++++++++++++++++++------ 3 files changed, 30 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index bde6c4c1f405..3070c1043acf 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount); void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch); s64 __percpu_counter_sum(struct percpu_counter *fbc); +s64 __percpu_counter_sum_all(struct percpu_counter *fbc); int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); void percpu_counter_sync(struct percpu_counter *fbc); @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) return __percpu_counter_sum(fbc); } +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) +{ + return __percpu_counter_sum_all(fbc); +} + static inline s64 percpu_counter_read(struct percpu_counter *fbc) { return fbc->count; @@ -193,6 +199,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) return percpu_counter_read(fbc); } +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) +{ + return percpu_counter_read(fbc); +} + static inline bool percpu_counter_initialized(struct percpu_counter *fbc) { return true; diff --git a/kernel/fork.c b/kernel/fork.c index 9c32f593ef11..7d6f510cf397 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = percpu_counter_sum_all(&mm->rss_stat[i]); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index ed610b75dc32..f26a1a5df399 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -117,11 +117,8 @@ void percpu_counter_sync(struct percpu_counter *fbc) } EXPORT_SYMBOL(percpu_counter_sync); -/* - * Add up all the per-cpu counts, return the result. This is a more accurate - * but much slower version of percpu_counter_read_positive() - */ -s64 __percpu_counter_sum(struct percpu_counter *fbc) +static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc, + const struct cpumask *cpu_mask) { s64 ret; int cpu; @@ -129,15 +126,30 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc) raw_spin_lock_irqsave(&fbc->lock, flags); ret = fbc->count; - for_each_online_cpu(cpu) { + for_each_cpu(cpu, cpu_mask) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); ret += *pcount; } raw_spin_unlock_irqrestore(&fbc->lock, flags); return ret; } + +/* + * Add up all the per-cpu counts, return the result. This is a more accurate + * but much slower version of percpu_counter_read_positive() + */ +s64 __percpu_counter_sum(struct percpu_counter *fbc) +{ + return __percpu_counter_sum_mask(fbc, cpu_online_mask); +} EXPORT_SYMBOL(__percpu_counter_sum); +s64 __percpu_counter_sum_all(struct percpu_counter *fbc) +{ + return __percpu_counter_sum_mask(fbc, cpu_possible_mask); +} +EXPORT_SYMBOL(__percpu_counter_sum_all); + int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp, struct lock_class_key *key) { -- 2.38.1.431.g37b22c650d-goog