Re: [PATCH] lpfc: correct writeq failures on 32-bit architectures

Arnd Bergmann <arnd@xxxxxxxx> · Mon, 26 Feb 2018 21:04:51 +0100

On Mon, Feb 26, 2018 at 6:01 PM, James Smart <jsmart2021@xxxxxxxxx> wrote:
>
>
> On 2/26/2018 12:36 AM, Arnd Bergmann wrote:
>>
>> Unfortunately, this is still broken on all big-endian architectures. You
>> could
>> use __raw_writeq() here to fix it, or change the if() clause at the
>> beginning
>> to include '!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)' to avoid that.
>>
>>          Arnd
>
>
> Please explain.  There doesn't seem to be a definitive guide on which of the
> several routines for bar mapping and the several routines for memory copying
> should be used.  We've gone with what we tested and measured.
>
> Also, this code is disabled on everything but X86 right now - so the patch
> should be fine. We settled on the combination which our testing showed the
> best results. We tried other architectures, but there wasn't a combination
> that showed great results. If we figure a combination out, we will update
> the code.

For the endianess, the key to understanding this is that readl/writel and
readq/writeq follow the convention of accessing data as little-endian because
that is what 99% of MMIO accesses on PCI are: you have a 32-bit or 64-bit
register value that gets copied between a register and the CPU, and it's
up to the device manufacturer to implement that convention in hardware.
The Linux implementation of writeq() on all architectures therefore puts the
LSB into the first byte of the output.

In contrast, accessing memory using a pointer dereference is defined by
the CPU manufacturer, so in a big-endian CPU the first byte corresponds
to the MSB of the CPU register.

In your loop

                for (i = 0; i < q->entry_size; i += sizeof(uint64_t))
                        writeq(*((uint64_t *)(tmp + i)), q->dpp_regaddr + i);

that ends up copying the first byte of 'tmp[]' to byte 7 of dpp_regaddr,
which makes no sense as the data in tmp is (presumably) just a stream of
bytes that we want to end up on a disk in the same order, regardless
of which mode the CPU happens to be in at the time we copy the data.

> Note: when we used ioremap_wc() both PPC/BE and X86 dropped in overall
> performance as it seems to make things cacheable and the speculation is the
> caching aspects delayed the results just enough to make the overall gain go
> down.

ioremap_wc should never make the access cacheable, but architectures
can fall back to a regular uncached mapping without write-combining when
there is no hardware support for write-combining.

However, when you use writeq() on powerpc, you get a very expensive
barriers ("sync") before each 8 byte chunk, which purposely defeats the
write-combining, plus you might get an indirect function call, depending on
the kernel configuration and the hardware you are running on.

Using __raw_writeq() should get both the endianess right and avoid all
those barriers you really don't want for what is essentially a memcpy().

       Arnd