Re: [GIT PULL] bitmap changes for v6.2-rc1

Yury Norov <yury.norov@xxxxxxxxx> · Mon, 9 Jan 2023 23:24:46 -0800

On Fri, Dec 23, 2022 at 10:44:07AM -0800, Linus Torvalds wrote:
> On Thu, Dec 15, 2022 at 3:11 PM Yury Norov <yury.norov@xxxxxxxxx> wrote:
> >
> > Please pull bitmap patches for v6.2. They spent in -next for more than
> > a week without any issues. The branch consists of:
> 
> So I've been holding off on this because these bitmap pulls have
> always scared me, and I wanted to have the time to actually look
> through them in detail before pulling.
> 
> I'm back home, over the travel chaos, and while I have other pulls
> pending, they seem to be benign fixes so I started looking at this.
> 
> And when looking at it, I did indeed finx what I think is a
> fundamental arithmetic bug.
> 
> That small_const_nbits_off() is simply buggy.
> 
> Try this:
> 
>         small_const_nbits_off(64,-1);
> 
> and see it return true.

Hi Linus,

Sorry for a delayed reply.

small_const_nbits{_off} is used only for bitmap and find_bit functions
where both offset and size are unsigned types. -1 there will turn into
UINT_MAX or ULONG_MAX, and small_const_nbits_off(64, -1) returns false.

The bitops.h functions (set_bit et all) also use unsigned type for nr.
Negative offsets are not used in bit operations from very basic level.

Notice that '(nbits) > 0' part in small_const_nbits() is there to exclude
0, not negative numbers.

So, support for negative offset/size looks irrelevant for all existing
users of small_const(). I doubt there'll be new non-bitmap users for
the macro anytime soon.

small_const_nbits() and proposed small_const_nbits_off() are in fact
very bitmap-related macros. 'Small' in this context refers to a
single-word bitmap. 0 is definitely a small and constant number, but
small_const_nbits(0) will return false - only because inline versions
of bitmap functions don't handle 0 correctly.

I think, small_const_nbits confuses because it sounds too generic and
is located in a very generic place. There's a historical reason for
that.

Originally, the macro was hosted in include/linux/bitmap.h and at that
time find.h was in include/asm-generic/bitops/. In commit 586eaebea5988
(lib: extend the scope of small_const_nbits() macro) I moved the macro
to include/asm-generic/bitsperlong.h to optimize find_bit functions too.

After that, working on other parts I found that having bitmap.h and
find.h in different include paths is a permanent headache due to things
like circular dependencies, and moved find.h to include/linux, where it
should be. And even made it an internal header for bitmap.h. But didn't
move small_const_nbits(). Looks like I have to move it to somewhere in
include/linux/bitops.h.

[...]

> So convince me not only that the optimizations are obviously correct,
> but also that they actually matter.

There're no existing users for small_const_nbits_off(). I've been
reworking bitmap_find_free_region(), added a pile of tests and
found that some quite trivial cases are not inlined, for example
find_next_bit(addr, 128, 124).

Let's ignore this patch unless we'll have real users.

Regarding the rest of the series. Can you please take a look? It includes
an optimization for CPUs allocations. With quite a simple rework of
cpumask_local_spread() we are gaining measurable and significant
improvement for many subsystems on NUMA machines.

Tariq measured impact of NUMA-based locality on networking in his
environment, and from his commit message:

    Performance tests:

    TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
    Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

    +-------------------------+-----------+------------------+------------------+
    |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
    +-------------------------+-----------+------------------+------------------+
    | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
    +-------------------------+-----------+------------------+------------------+
    | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
    +-------------------------+-----------+------------------+------------------+
    | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
    +-------------------------+-----------+------------------+------------------+
    | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
    +-------------------------+-----------+------------------+------------------+

    Bottleneck in RX side is released, reached linerate (~1.8x speedup).
    ~30% less cpu util on TX.

We'd really like to have this work in next kernel release.

Thanks,
Yury