Re: [PATCH v3] wait_on_bit: add an acquire memory barrier

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 26 Aug 2022 09:45:47 -0700

On Fri, Aug 26, 2022 at 6:17 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
> I wouldn't do this for regular test_bit because if you read memory with
> different size/alignment from what you wrote, various CPUs suffer from
> store->load forwarding penalties.

All of the half-way modern CPU's do ok with store->load forwarding as
long as the load is fully contained in the store, so I suspect the
'testb' model is pretty much universally better than loading a word
into a register and testing it there.

So narrowing the load is fine (but you generally never want to narrow
a *store*, because that results in huge problems with subsequent wider
loads).

But it's not a huge deal, and this way if somebody actually runs the
numbers and does any comparisons, we have both versions, and if the
'testb' is better, we can just rename the x86
constant_test_bit_acquire() to just constant_test_bit() and use it for
both cases.

> But for test_bit_acqure this optimization is likely harmless because the
> bit will not be tested a few instructions after writing it.

Note that if we really do that, then we've already lost because of the
volatile access, ie if we cared about a "write bit, test bit" pattern,
we should use other operations.

Now, the new "const_test_bit()" logic (commits bb7379bfa680 "bitops:
define const_*() versions of the non-atomics" and 0e862838f290
"bitops: unify non-atomic bitops prototypes across architectures")
means that as long as you are setting and testing a bit in a local
variable, it gets elided entirely. But in general use you're going to
see that load from memory, and then the wider load is likely worse
(because bigger constants, and because it requries a register).

So maybe there is room to tweak it further, but this version of the
patch looks good to me, and I've applied it.

Thanks,
                Linus