On Sun, Dec 03, 2023 at 11:23:47AM -0800, Yury Norov wrote: > Add helpers around test_and_{set,clear}_bit() that allow to search for > clear or set bits and flip them atomically. > > The target patterns may look like this: > > for (idx = 0; idx < nbits; idx++) > if (test_and_clear_bit(idx, bitmap)) > do_something(idx); > > Or like this: > > do { > bit = find_first_bit(bitmap, nbits); > if (bit >= nbits) > return nbits; > } while (!test_and_clear_bit(bit, bitmap)); > return bit; > > In both cases, the opencoded loop may be converted to a single function > or iterator call. Correspondingly: > > for_each_test_and_clear_bit(idx, bitmap, nbits) > do_something(idx); > > Or: > return find_and_clear_bit(bitmap, nbits); > > Obviously, the less routine code people have to write themself, the > less probability to make a mistake. > > Those are not only handy helpers but also resolve a non-trivial > issue of using non-atomic find_bit() together with atomic > test_and_{set,clear)_bit(). > > The trick is that find_bit() implies that the bitmap is a regular > non-volatile piece of memory, and compiler is allowed to use such > optimization techniques like re-fetching memory instead of caching it. > > For example, find_first_bit() is implemented like this: > > for (idx = 0; idx * BITS_PER_LONG < sz; idx++) { > val = addr[idx]; > if (val) { > sz = min(idx * BITS_PER_LONG + __ffs(val), sz); > break; > } > } > > On register-memory architectures, like x86, compiler may decide to > access memory twice - first time to compare against 0, and second time > to fetch its value to pass it to __ffs(). > > When running find_first_bit() on volatile memory, the memory may get > changed in-between, and for instance, it may lead to passing 0 to > __ffs(), which is undefined. This is a potentially dangerous call. > > find_and_clear_bit() as a wrapper around test_and_clear_bit() > naturally treats underlying bitmap as a volatile memory and prevents > compiler from such optimizations. > > Now that KCSAN is catching exactly this type of situations and warns on > undercover memory modifications. We can use it to reveal improper usage > of find_bit(), and convert it to atomic find_and_*_bit() as appropriate. > > The 1st patch of the series adds the following atomic primitives: > > find_and_set_bit(addr, nbits); > find_and_set_next_bit(addr, nbits, start); > ... > > Here find_and_{set,clear} part refers to the corresponding > test_and_{set,clear}_bit function. Suffixes like _wrap or _lock > derive their semantics from corresponding find() or test() functions. > > For brevity, the naming omits the fact that we search for zero bit in > find_and_set, and correspondingly search for set bit in find_and_clear > functions. > > The patch also adds iterators with atomic semantics, like > for_each_test_and_set_bit(). Here, the naming rule is to simply prefix > corresponding atomic operation with 'for_each'. > > This series is a result of discussion [1]. All find_bit() functions imply > exclusive access to the bitmaps. However, KCSAN reports quite a number > of warnings related to find_bit() API. Some of them are not pointing > to real bugs because in many situations people intentionally allow > concurrent bitmap operations. > > If so, find_bit() can be annotated such that KCSAN will ignore it: > > bit = data_race(find_first_bit(bitmap, nbits)); > > This series addresses the other important case where people really need > atomic find ops. As the following patches show, the resulting code > looks safer and more verbose comparing to opencoded loops followed by > atomic bit flips. > > In [1] Mirsad reported 2% slowdown in a single-thread search test when > switching find_bit() function to treat bitmaps as volatile arrays. On > the other hand, kernel robot in the same thread reported +3.7% to the > performance of will-it-scale.per_thread_ops test. > > Assuming that our compilers are sane and generate better code against > properly annotated data, the above discrepancy doesn't look weird. When > running on non-volatile bitmaps, plain find_bit() outperforms atomic > find_and_bit(), and vice-versa. ... In some cases the better improvements can be achieved by switching the (very) old code to utilise IDA framework. -- With Best Regards, Andy Shevchenko