On Mon, Feb 26, 2018 at 6:01 PM, James Smart <jsmart2021@xxxxxxxxx> wrote: > > > On 2/26/2018 12:36 AM, Arnd Bergmann wrote: >> >> Unfortunately, this is still broken on all big-endian architectures. You >> could >> use __raw_writeq() here to fix it, or change the if() clause at the >> beginning >> to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that. >> >> Arnd > > > Please explain. There doesn't seem to be a definitive guide on which of the > several routines for bar mapping and the several routines for memory copying > should be used. We've gone with what we tested and measured. > > Also, this code is disabled on everything but X86 right now - so the patch > should be fine. We settled on the combination which our testing showed the > best results. We tried other architectures, but there wasn't a combination > that showed great results. If we figure a combination out, we will update > the code. For the endianess, the key to understanding this is that readl/writel and readq/writeq follow the convention of accessing data as little-endian because that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit register value that gets copied between a register and the CPU, and it's up to the device manufacturer to implement that convention in hardware. The Linux implementation of writeq() on all architectures therefore puts the LSB into the first byte of the output. In contrast, accessing memory using a pointer dereference is defined by the CPU manufacturer, so in a big-endian CPU the first byte corresponds to the MSB of the CPU register. In your loop for (i = 0; i < q->entry_size; i += sizeof(uint64_t)) writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i); that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr, which makes no sense as the data in tmp is (presumably) just a stream of bytes that we want to end up on a disk in the same order, regardless of which mode the CPU happens to be in at the time we copy the data. > Note: when we used ioremap_wc() both PPC/BE and X86 dropped in overall > performance as it seems to make things cacheable and the speculation is the > caching aspects delayed the results just enough to make the overall gain go > down. ioremap_wc should never make the access cacheable, but architectures can fall back to a regular uncached mapping without write-combining when there is no hardware support for write-combining. However, when you use writeq() on powerpc, you get a very expensive barriers ("sync") before each 8 byte chunk, which purposely defeats the write-combining, plus you might get an indirect function call, depending on the kernel configuration and the hardware you are running on. Using __raw_writeq() should get both the endianess right and avoid all those barriers you really don't want for what is essentially a memcpy(). Arnd