Re: [PATCH] arch: Introduce read_acquire()

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 11 Nov 2014 11:40:22 -0800

On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@xxxxxxxxx> wrote:
>
>
> On reviewing the documentation and code for smp_load_acquire() it occured
> to me that implementing something similar for CPU <-> device interraction
> would be worth while.  This commit provides just the load/read side of this
> in the form of read_acquire().

So I don't hate the concept, but. there's a couple of reasons to think
this is broken.

One is just the name. Why do we have "smp_load_acquire()", but then
call the non-smp version "read_acquire()"? That makes very little
sense to me. Why did "load" become "read"?

The other is more of a technical issue. Namely the fact that being
*device* ordered is completely and totally different from being *CPU*
ordered.

All sane modern architectures do memory ordering as part of the cache
coherency protocol. But part of that is that it actually requires all
the CPU's to follow said cache coherency protocol.

And bus-master devices don't necessarily follow the same ordering
rules. Yes, any sane DMA will be cache-coherent, although sadly the
insane kind still exists. But even when DMA is cache _coherent_, that
does not necessarily mean that that coherency follows the *ordering*
guarantees.

Now, in practice, I think that DMA tends to be more strictly ordered
than  CPU memory ordering is, and the above all "just works". But I'd
really want a lot of ack's from architecture maintainers. Particularly
PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will
necessarily order the read wrt DMA traffic. PPC in particular has some
really odd IO ordering rules, but I *think* all the problems come up
with just MMIO, not with DMA.

But we do have a very real difference between "smp_rmb()" (inter-cpu
cache coherency read barrier) and "rmb()" (full memory barrier that
synchronizes with IO).

And your patch is very confused about this. In *some* places you use
"rmb()", and in other places you just use "smp_load_acquire()". Have
you done extensive verification to check that this is actually ok?
Because the performance difference you quote very much seems to be
about your x86 testing now akipping the IO-synchronizing "rmb()", and
depending on DMA being ordered even without it.

And I'm pretty sure that's actually fine on x86. The real
IO-synchronizing rmb() (which translates into a lfence) is only needed
for when you have uncached accesses (ie mmio) on x86. So I don't think
your code is wrong, I just want to verify that everybody understands
the issues. I'm not even sure DMA can ever really have weaker memory
ordering (I really don't see how you'd be able to do a read barrier
without DMA stores being ordered natively), so maybe I worry too much,
but the ppc people in particular should look at this, because the ppc
memory ordering rules and serialization are some completely odd ad-hoc
black magic....

But anything with non-cache-coherent DMA is obviously very suspect too.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html