Re: [PATCH v3 1/9] cpumask: Make cpumask_full() check for nr_cpu_ids bits

Yury Norov <yury.norov@xxxxxxxxx> · Sun, 28 Aug 2022 09:38:35 -0700

On Sun, Aug 28, 2022 at 10:35:38AM +0200, Sander Vanheule wrote:

 ...

> > It's really a puzzle, and some of my thoughts are below. So. 
> > 
> > This is a question what for we need nr_cpumask_bits while we already
> > have nr_cpu_ids. When OFFSTACK is ON, they are obviously the same.
> > When it's of - the nr_cpumask_bits is an alias to NR_CPUS.
> > 
> > I tried to wire the nr_cpumask_bits to nr_cpu_ids unconditionally, and
> > it works even when OFFSTACK is OFF, no surprises.
> > 
> > I didn't find any discussions describing what for we need nr_cpumask_bits,
> > and the code adding it dates to a very long ago.
> > 
> > If I alias nr_cpumask_bits to nr_cpu_ids unconditionally on my VM with
> > NR_CPUs == 256 and nr_cpu_ids == 4, there's obviously a clear win in
> > performance, but the Image size gets 2.5K bigger. Probably that's the
> > reason for what nr_cpumask_bits was needed...
> 
> I think it makes sense to have a compile-time-constant value for nr_cpumask_bits
> in some cases. For example on embedded platforms, where every opportunity to
> save a few kB should be used, or cases where NR_CPUS <= BITS_PER_LONG.
> 
> > 
> > There's also a very old misleading comment in cpumask.h:
> > 
> >  *  If HOTPLUG is enabled, then cpu_possible_mask is forced to have
> >  *  all NR_CPUS bits set, otherwise it is just the set of CPUs that
> >  *  ACPI reports present at boot.
> > 
> > It lies, and I checked with x86_64 that cpu_possible_mask is populated
> > during boot time with 0b1111, if I create a 4-cpu VM. Hence, the
> > nr_cpu_ids is 4, while NR_CPUS == 256.
> > 
> > Interestingly, there's no a single user of the cpumask_full(),
> > obviously, because it's broken. This is really a broken dead code.
> > 
> > Now that we have a test that checks sanity of cpumasks, this mess
> > popped up.
> > 
> > Your fix doesn't look correct, because it fixes one function, and
> > doesn't touch others. For example, the cpumask subset() may fail
> > if src1p will have set bits after nr_cpu_ids, while cpumask_full()
> > will be returning true.
> 
> It appears the documentation for cpumask_full() is also incorrect, because it
> claims to check if all CPUs < nr_cpu_ids are set. Meanwhile, the implementation
> checks if all CPUs < nr_cpumask_bits are set.
> 
> cpumask_weight() has a similar issue, and maybe also other cpumask_*() functions
> (I didn't check in detail yet).
> 
> > 
> > In -next, there is an update from Sander for the cpumask test that
> > removes this check, and probably if you rebase on top of -next, you
> > can drop this and 2nd patch of your series.
> > 
> > What about proper fix? I think that a long time ago we didn't have
> > ACPI tables for possible cpus, and didn't populate cpumask_possible
> > from that, so the
> > 
> >         #define nr_cpumask_bits NR_CPUS
> > 
> > worked well. Now that we have cpumask_possible partially filled,
> > we have to always
> > 
> >         #define nr_cpumask_bits nr_cpu_ids
> > 
> > and pay +2.5K price in size even if OFFSTACK is OFF. At least, it wins
> > at runtime...
> > 
> > Any thoughts?
> 
> It looks like both nr_cpumask_bits and nr_cpu_ids are used in a number of places
> outside of lib/cpumask.c. Documentation for cpumask_*() functions almost always
> refers to nr_cpu_ids as a highest valid value.
> 
> Perhaps nr_cpumask_bits should become an variable for internal cpumask usage,
> and external users should only use nr_cpu_ids? The changes in 6.0 are my first
> real interaction with cpumask, so it's possible that there are things I'm
> missing here.
> 
> That being said, some of the cpumask tests compare results to nr_cpumask_bits,
> so those should then probably be fixed to compare against nr_cpu_ids instead.

Aha, and it kills me how we have such a mess in a very core subsystem.

We have 3 problems here:
 - mess with nr_cpumask_bits and nr_cpu_ids;
 - ineffectiveness of cpumask routines when nr_cpumask_bits > nr_cpu_ids;
 - runtime nature of nr_cpu_ids, even for those embedded systems with
   taught memory constraints. So that if we just drop nr_cpumask_bits,
   it will add 2.5K to the Image.

I think that dropping nr_cpumask_bits is our only choice, and to avoid
Image bloating for embedded users, we can hint the kernel that NR_CPUS
is an exact number, so that it will skip setting it in runtime.

I added a EXACT_NR_CPUS option for this, which works like this:

  #if (NR_CPUS == 1) || defined(CONFIG_EXACT_NR_CPUS)
  #define nr_cpu_ids      ((unsigned int)NR_CPUS)
  #else
  extern unsigned int nr_cpu_ids;
  #endif

  /* Deprecated */ 
  #define nr_cpumask_bits nr_cpu_ids

I tried it with arm64 4-CPU build. When the EXACT_NR_CPUS is enabled,
the difference is:
  add/remove: 3/4 grow/shrink: 46/729 up/down: 652/-46952 (-46300)
  Total: Before=25670945, After=25624645, chg -0.18%

Looks quite impressive to me. I'll send a patch soon.

Thanks,
Yury