On Mon, Dec 11, 2023 at 06:27:14PM -0800, Yury Norov wrote: > Add helpers around test_and_{set,clear}_bit() that allow to search for > clear or set bits and flip them atomically. > > The target patterns may look like this: > > for (idx = 0; idx < nbits; idx++) > if (test_and_clear_bit(idx, bitmap)) > do_something(idx); > > Or like this: > > do { > bit = find_first_bit(bitmap, nbits); > if (bit >= nbits) > return nbits; > } while (!test_and_clear_bit(bit, bitmap)); > return bit; > > In both cases, the opencoded loop may be converted to a single function > or iterator call. Correspondingly: > > for_each_test_and_clear_bit(idx, bitmap, nbits) > do_something(idx); > > Or: > return find_and_clear_bit(bitmap, nbits); > > Obviously, the less routine code people have to write themself, the > less probability to make a mistake. > > Those are not only handy helpers but also resolve a non-trivial > issue of using non-atomic find_bit() together with atomic > test_and_{set,clear)_bit(). > > The trick is that find_bit() implies that the bitmap is a regular > non-volatile piece of memory, and compiler is allowed to use such > optimization techniques like re-fetching memory instead of caching it. > > For example, find_first_bit() is implemented like this: > > for (idx = 0; idx * BITS_PER_LONG < sz; idx++) { > val = addr[idx]; > if (val) { > sz = min(idx * BITS_PER_LONG + __ffs(val), sz); > break; > } > } > > On register-memory architectures, like x86, compiler may decide to > access memory twice - first time to compare against 0, and second time > to fetch its value to pass it to __ffs(). > > When running find_first_bit() on volatile memory, the memory may get > changed in-between, and for instance, it may lead to passing 0 to > __ffs(), which is undefined. This is a potentially dangerous call. > > find_and_clear_bit() as a wrapper around test_and_clear_bit() > naturally treats underlying bitmap as a volatile memory and prevents > compiler from such optimizations. > > Now that KCSAN is catching exactly this type of situations and warns on > undercover memory modifications. We can use it to reveal improper usage > of find_bit(), and convert it to atomic find_and_*_bit() as appropriate. > > In some cases concurrent operations with plain find_bit() are acceptable. > For example: > > - two threads running find_*_bit(): safe wrt ffs(0) and returns correct > value, because underlying bitmap is unchanged; > - find_next_bit() in parallel with set or clear_bit(), when modifying > a bit prior to the start bit to search: safe and correct; > - find_first_bit() in parallel with set_bit(): safe, but may return wrong > bit number; > - find_first_zero_bit() in parallel with clear_bit(): same as above. > > In last 2 cases find_bit() may not return a correct bit number, but > it may be OK if caller requires any (not exactly the first) set or clear > bit, correspondingly. > > In such cases, KCSAN may be safely silenced with data_race(). But in most > cases where KCSAN detects concurrency people should carefully review their > code and likely protect critical sections or switch to atomic > find_and_bit(), as appropriate. > > The 1st patch of the series adds the following atomic primitives: > > find_and_set_bit(addr, nbits); > find_and_set_next_bit(addr, nbits, start); > ... > > Here find_and_{set,clear} part refers to the corresponding > test_and_{set,clear}_bit function. Suffixes like _wrap or _lock > derive their semantics from corresponding find() or test() functions. > > For brevity, the naming omits the fact that we search for zero bit in > find_and_set, and correspondingly search for set bit in find_and_clear > functions. > > The patch also adds iterators with atomic semantics, like > for_each_test_and_set_bit(). Here, the naming rule is to simply prefix > corresponding atomic operation with 'for_each'. > > In [1] Jan reported 2% slowdown in a single-thread search test when > switching find_bit() function to treat bitmaps as volatile arrays. On > the other hand, kernel robot in the same thread reported +3.7% to the > performance of will-it-scale.per_thread_ops test. > > Assuming that our compilers are sane and generate better code against > properly annotated data, the above discrepancy doesn't look weird. When > running on non-volatile bitmaps, plain find_bit() outperforms atomic > find_and_bit(), and vice-versa. > > So, all users of find_bit() API, where heavy concurrency is expected, > are encouraged to switch to atomic find_and_bit() as appropriate. > > The 1st patch of this series adds atomic find_and_bit() API, 2nd adds > a basic test for new API, and all the following patches spread it over > the kernel. > > They can be applied separately from each other on per-subsystems basis, > or I can pull them in bitmap tree, as appropriate. > > [1] https://lore.kernel.org/lkml/634f5fdf-e236-42cf-be8d-48a581c21660@xxxxxxxxxxxx/T/#m3e7341eb3571753f3acf8fe166f3fb5b2c12e615 Thank you all for reviews and comments. Now moving the series to bitmap-for-next for testing. Thanks, Yury