On Fri, Dec 23, 2022 at 10:44:07AM -0800, Linus Torvalds wrote: > On Thu, Dec 15, 2022 at 3:11 PM Yury Norov <yury.norov@xxxxxxxxx> wrote: > > > > Please pull bitmap patches for v6.2. They spent in -next for more than > > a week without any issues. The branch consists of: > > So I've been holding off on this because these bitmap pulls have > always scared me, and I wanted to have the time to actually look > through them in detail before pulling. > > I'm back home, over the travel chaos, and while I have other pulls > pending, they seem to be benign fixes so I started looking at this. > > And when looking at it, I did indeed finx what I think is a > fundamental arithmetic bug. > > That small_const_nbits_off() is simply buggy. > > Try this: > > small_const_nbits_off(64,-1); > > and see it return true. Hi Linus, Sorry for a delayed reply. small_const_nbits{_off} is used only for bitmap and find_bit functions where both offset and size are unsigned types. -1 there will turn into UINT_MAX or ULONG_MAX, and small_const_nbits_off(64, -1) returns false. The bitops.h functions (set_bit et all) also use unsigned type for nr. Negative offsets are not used in bit operations from very basic level. Notice that '(nbits) > 0' part in small_const_nbits() is there to exclude 0, not negative numbers. So, support for negative offset/size looks irrelevant for all existing users of small_const(). I doubt there'll be new non-bitmap users for the macro anytime soon. small_const_nbits() and proposed small_const_nbits_off() are in fact very bitmap-related macros. 'Small' in this context refers to a single-word bitmap. 0 is definitely a small and constant number, but small_const_nbits(0) will return false - only because inline versions of bitmap functions don't handle 0 correctly. I think, small_const_nbits confuses because it sounds too generic and is located in a very generic place. There's a historical reason for that. Originally, the macro was hosted in include/linux/bitmap.h and at that time find.h was in include/asm-generic/bitops/. In commit 586eaebea5988 (lib: extend the scope of small_const_nbits() macro) I moved the macro to include/asm-generic/bitsperlong.h to optimize find_bit functions too. After that, working on other parts I found that having bitmap.h and find.h in different include paths is a permanent headache due to things like circular dependencies, and moved find.h to include/linux, where it should be. And even made it an internal header for bitmap.h. But didn't move small_const_nbits(). Looks like I have to move it to somewhere in include/linux/bitops.h. [...] > So convince me not only that the optimizations are obviously correct, > but also that they actually matter. There're no existing users for small_const_nbits_off(). I've been reworking bitmap_find_free_region(), added a pile of tests and found that some quite trivial cases are not inlined, for example find_next_bit(addr, 128, 124). Let's ignore this patch unless we'll have real users. Regarding the rest of the series. Can you please take a look? It includes an optimization for CPUs allocations. With quite a simple rework of cpumask_local_spread() we are gaining measurable and significant improvement for many subsystems on NUMA machines. Tariq measured impact of NUMA-based locality on networking in his environment, and from his commit message: Performance tests: TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on). Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 +-------------------------+-----------+------------------+------------------+ | | BW (Gbps) | TX side CPU util | RX side CPU util | +-------------------------+-----------+------------------+------------------+ | Baseline | 52.3 | 6.4 % | 17.9 % | +-------------------------+-----------+------------------+------------------+ | Applied on TX side only | 52.6 | 5.2 % | 18.5 % | +-------------------------+-----------+------------------+------------------+ | Applied on RX side only | 94.9 | 11.9 % | 27.2 % | +-------------------------+-----------+------------------+------------------+ | Applied on both sides | 95.1 | 8.4 % | 27.3 % | +-------------------------+-----------+------------------+------------------+ Bottleneck in RX side is released, reached linerate (~1.8x speedup). ~30% less cpu util on TX. We'd really like to have this work in next kernel release. Thanks, Yury