On Tue, Apr 04, 2023 at 02:54:25PM +0800, yebin (H) wrote: > > > On 2023/4/4 10:50, Yury Norov wrote: > > On Tue, Apr 04, 2023 at 09:42:06AM +0800, Ye Bin wrote: > > > From: Ye Bin <yebin10@xxxxxxxxxx> > > > > > > In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race > > > condition between a cpu dying and percpu_counter_sum() iterating online CPUs > > > was identified. > > > Acctually, there's the same race condition between a cpu dying and > > > __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. > > > But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', > > > then maybe return incorrect result. > > > To solve above issue, also need to add dying CPUs count when do quick judgment > > > in __percpu_counter_compare(). > > Not sure I completely understood the race you are describing. All CPU > > accounting is protected with percpu_counters_lock. Is it a real race > > that you've faced, or hypothetical? If it's real, can you share stack > > traces? > > > Signed-off-by: Ye Bin <yebin10@xxxxxxxxxx> > > > --- > > > lib/percpu_counter.c | 11 ++++++++++- > > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > > > index 5004463c4f9f..399840cb0012 100644 > > > --- a/lib/percpu_counter.c > > > +++ b/lib/percpu_counter.c > > > @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu) > > > return 0; > > > } > > > +static __always_inline unsigned int num_count_cpus(void) > > This doesn't look like a good name. Maybe num_offline_cpus? > > > > > +{ > > > +#ifdef CONFIG_HOTPLUG_CPU > > > + return (num_online_cpus() + num_dying_cpus()); > > ^ ^ > > 'return' is not a function. Braces are not needed > > > > Generally speaking, a sequence of atomic operations is not an atomic > > operation, so the above doesn't look correct. I don't think that it > > would be possible to implement raceless accounting based on 2 separate > > counters. > Yes, there is indeed a concurrency issue with doing so here. But I saw that > the process was first > set up dying_mask and then reduce the number of online CPUs. The total > quantity maybe is larger > than the actual value and may fall back to a slow path.But this won't cause > any problems. This sounds like an implementation detail. If it will change in future, your accounting will get broken. If you think it's a consistent behavior and will be preserved in future, then it must be properly commented in your patch. Thanks, Yury