Re: I/O and memory barriers

luca ellero <lroluk@xxxxxxxxx> · Tue, 08 Jun 2010 14:06:17 +0200

Pei Lin ha scritto:
2010/6/7 luca ellero <lroluk@xxxxxxxxx>:

Thanks again for your replay, anyway  I'm still confused. See inline
comments.

Pei Lin wrote:

2010/5/31 luca ellero <lroluk@xxxxxxxxx>:

Pei Lin wrote:

2010/5/17 luca ellero <lroluk@xxxxxxxxx>:

Hi list,
I have some (maybe stupid) questions which I can't answer even after
reading
lots of documentation.
Suppose I have a PCI device which has some I/O registers mapped to
memory
(here I mean access are made through memory, not I/O space).
As far as I know the right way to access them is through functions such
as
iowrite8 and friends:

spin_lock(Q)
iowrite8(some_address, ADDR)
iowrite8(some_data, DATA);
spin_unlock(Q);

My questions are:

1) Do I need a write memory barrier (wmb) between the two iowrite8?
I think I need it because I've read the implementation of iowrite8 and
(in
kernel 2.6.30.6) this expands to:

void iowrite8(u8 val, void *addr)
{
 do {
    unsigned long port = (unsigned long )addr;
    if (port >= 0x40000UL) {
        writeb(val, addr);
    } else if (port > 0x10000UL) {
        port &= 0x0ffffUL;
        outb(val,port);
    } else bad_io_access(port, "outb(val,port)" );
 } while (0);
}

where writeb is:

static inline void writeb(unsigned char val, volatile void *addr) {
 asm volatile("movb %0,%1":
    :"q" (val), "m" (*(volatile unsigned char *)addr)
    :"memory");
}

which contains only a compiler barrier (the :"memory" in the asm
statement)
but no CPU barrier. So, without wmb(), CPU can reorder the iowrite8
with
disastrous effect. Am I right?

2) do I need mmiowb() before spin_unlock()?
Documentation about mmiowb() is really confusing me, so any explanation
about his use is really welcome.

See the documentation which explains it  clearly.
http://lxr.linux.no/linux+v2.6.27.46/Documentation/memory-barriers.txt

1295LOCKS VS I/O ACCESSES
1296---------------------
1297
1298Under certain circumstances (especially involving NUMA), I/O
accesses
within
1299two spinlocked sections on two different CPUs may be seen as
interleaved by the
1300PCI bridge, because the PCI bridge does not necessarily participate
in
the
1301cache-coherence protocol, and is therefore incapable of issuing the
required
1302read memory barriers.
1303
1304For example:
1305
1306        CPU 1                           CPU 2
1307        ===============================
===============================
1308        spin_lock(Q)
1309        writel(0, ADDR)
1310        writel(1, DATA);
1311        spin_unlock(Q);
1312                                        spin_lock(Q);
1313                                        writel(4, ADDR);
1314                                        writel(5, DATA);
1315                                        spin_unlock(Q);
1316
1317may be seen by the PCI bridge as follows:
1318
1319        STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE
*DATA
= 5
1320
1321which would probably cause the hardware to malfunction.
1322
1323
1324What is necessary here is to intervene with an mmiowb() before
dropping the
1325spinlock, for example:
1326
1327        CPU 1                           CPU 2
1328        ===============================
===============================
1329        spin_lock(Q)
1330        writel(0, ADDR)
1331        writel(1, DATA);
1332        mmiowb();
1333        spin_unlock(Q);
1334                                        spin_lock(Q);
1335                                        writel(4, ADDR);
1336                                        writel(5, DATA);
1337                                        mmiowb();
1338                                        spin_unlock(Q);
1339
1340this will ensure that the two stores issued on CPU 1 appear at the
PCI bridge
1341before either of the stores issued on CPU 2.
1342
1343
1344Furthermore, following a store by a load from the same device
obviates the need
1345for the mmiowb(), because the load forces the store to complete
before the load
1346is performed:
1347
1348        CPU 1                           CPU 2
1349        ===============================
===============================
1350        spin_lock(Q)
1351        writel(0, ADDR)
1352        a = readl(DATA);
1353        spin_unlock(Q);
1354                                        spin_lock(Q);
1355                                        writel(4, ADDR);
1356                                        b = readl(DATA);
1357                                        spin_unlock(Q);
1358
1359
1360See Documentation/DocBook/deviceiobook.tmpl for more information.

Thanks for your reply,
I've already read the documentation, anyway what surprises me is the fact
that mmiowb() (at least on x86) is defined as a compiler barrier
(barrier())
and nothing else. I would expect it to do something more than that: some
specific PCI command or at least a dummy "read" from some PCI register
(since a read  forces the store to complete).

As for MIPS, it defined as
/* Depends on MIPS II instruction set */
 #define mmiowb() asm volatile ("sync" ::: "memory") .

For X86
#define mb()    asm volatile("mfence":::"memory")
#define rmb()   asm volatile("lfence":::"memory")
#define wmb()   asm volatile("sfence" ::: "memory")
For x86,  use the mfence/lfence/sfence pair to guarantee it.

That's not true. I confirm you my previous assertion . On x86, mmiowb
doesn't use any mfence/lfence/sfence, it's only a compiler barrier:

look at the e-mail i provided.
"Now, on x86, the CPU actually tends to order IO writes *more* than it
orders any other writes (they are mostly entirely synchronous, unless the
area has been marked as write merging), but at least on PPC, it's the
other way around: without the cache as a serialization entry, you end up
having a totally separate queueu to serialize, and a regular-memory write
barrier does nothing at all to the IO queue."
So on X86, mmiowb only define to asm volatile (" " ::: "memory") .
Another word for it, x86  can guarantee the order for IO writes, i
think.

I think this is true too. But where can I find official documentation 
about this?  No document in "Documentation" kernel source says x86 
doesn't reorder IO writes. And no "Intel x86 Architecture" document talk 
about this.
The only document  I found talking about this is 
Documentation\PCI\pci.txt (paragraph MMIO Space and "Write Posting") 
which specifically says:
"Writes to MMIO space allow the CPU to continue before the transaction 
reaches the PCI device" and suggest to add a dummy read to be sure the 
write really reaches the hardware.
x86 is definitely a PCI architecture. So, who can we trust?
Am I missing something?

See arch\x86\include\asm\io.h:
#define mmiowb() barrier()

i found an old mail discussion for the mmiowb() usage.
http://www.gelato.unsw.edu.au/archives/linux-ia64/0708/21056.html
http://www.gelato.unsw.edu.au/archives/linux-ia64/0708/21096.html
From: Nick Piggin <npiggin_at_suse.de>
Date: 2007-08-24 12:59:16
On Thu, Aug 23, 2007 at 09:16:42AM -0700, Linus Torvalds wrote:

On Thu, 23 Aug 2007, Nick Piggin wrote:

Also, FWIW, there are some advantages of deferring the mmiowb thingy
until the point of unlock.

And that is exactly what ppc64 does.

But you're missing a big point: for 99.9% of all hardware, mmiowb() is a
total no-op. So when you talk about "advantages", you're not talking
about
any *real* advantage, are you?

Furthermore a lot of PCI drivers seem to ignore its use.
Can you explain me that?

i only got one linker which may explain why many driver removed the
mmiowb().
http://lwn.net/Articles/283776/

As far as I can see in 2.6.33 code, this patch was not applied in vanilla
kernel source. So that's not the point.
Regards
Luca

--
To unsubscribe from this list: send an email with
"unsubscribe kernelnewbies" to ecartis@xxxxxxxxxxxx
Please read the FAQ at http://kernelnewbies.org/FAQ