On Fri, Aug 26, 2022 at 6:17 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > I wouldn't do this for regular test_bit because if you read memory with > different size/alignment from what you wrote, various CPUs suffer from > store->load forwarding penalties. All of the half-way modern CPU's do ok with store->load forwarding as long as the load is fully contained in the store, so I suspect the 'testb' model is pretty much universally better than loading a word into a register and testing it there. So narrowing the load is fine (but you generally never want to narrow a *store*, because that results in huge problems with subsequent wider loads). But it's not a huge deal, and this way if somebody actually runs the numbers and does any comparisons, we have both versions, and if the 'testb' is better, we can just rename the x86 constant_test_bit_acquire() to just constant_test_bit() and use it for both cases. > But for test_bit_acqure this optimization is likely harmless because the > bit will not be tested a few instructions after writing it. Note that if we really do that, then we've already lost because of the volatile access, ie if we cared about a "write bit, test bit" pattern, we should use other operations. Now, the new "const_test_bit()" logic (commits bb7379bfa680 "bitops: define const_*() versions of the non-atomics" and 0e862838f290 "bitops: unify non-atomic bitops prototypes across architectures") means that as long as you are setting and testing a bit in a local variable, it gets elided entirely. But in general use you're going to see that load from memory, and then the wider load is likely worse (because bigger constants, and because it requries a register). So maybe there is room to tweak it further, but this version of the patch looks good to me, and I've applied it. Thanks, Linus