Re: [RFC PATCH 5/8] qaic: Implement data path

Arnd Bergmann <arnd@xxxxxxxx> · Fri, 15 May 2020 00:20:55 +0200

On Fri, May 15, 2020 at 12:06 AM Jeffrey Hugo <jhugo@xxxxxxxxxxxxxx> wrote:

> >> +       u16     req_id;
> >> +       u8      seq_id;
> >> +       u8      cmd;
> >> +       u32     resv;
> >> +       u64     src_addr;
> >> +       u64     dest_addr;
> >> +       u32     len;
> >> +       u32     resv2;
> >> +       u64     db_addr; /* doorbell address */
> >> +       u8      db_len; /* not a raw value, special encoding */
> >> +       u8      resv3;
> >> +       u16     resv4;
> >> +       u32     db_data;
> >> +       u32     sem_cmd0;
> >> +       u32     sem_cmd1;
> >> +       u32     sem_cmd2;
> >> +       u32     sem_cmd3;
> >> +} __packed;
> >
> > All members are naturally aligned, so better drop the __packed
> > annotation get better code, unless the structure itself is
> > at an unaligned offset in memory.
>
> I'm going to have to disagree.  While most "sane" compilers would not
> add extra padding, I've debugged enough issues in the past when
> sending/receiving data with foreign environments to never trust anything
> that isn't "packed".
>
> Unless I missed something in the C spec that requires naturally aligned
> structures to have an identical layout in memory, I'll take safety and
> functional correctness over performance.

The C standard does not mandate the layout and individual ABIs can
be different, but Linux does only runs on ABIs that have the correct
layout for the structure above

The problem is that the compiler will split up word accesses into bytes
on some architectures such as older ARM, to avoid unaligned
loads and stores. It should be fine if you add both __packed and
__aligned(8) to make the structure aligned again.

> >> +       init_completion(&mem->xfer_done);
> >> +       list_add_tail(&mem->list, &dbc->xfer_list);
> >> +       tail = (tail + mem->nents) % dbc->nelem;
> >> +       __raw_writel(cpu_to_le32(tail), dbc->dbc_base + REQTP_OFF);
> >
> > What is this __raw accessor for? This generally results in non-portable
> > code that should be replaced with writel() or something specific to
> > the bus on the architecture you deal with.
>
> The barrier(s) that comes with writel are unnecessary in this case.
> Since this is part of our critical path, we are sensitive to its
> performance.
>
> What are the portability issues around the __raw variant?

The access may not be atomic on all architectures but get either
split up or combined with adjacent accesses.

There is no guarantee about endianess: while most architectures
treat __raw access as a simple load/store, some might have an
implied byteswap.

__raw accesses might be completely reordered around other
mmio accesses or spinlocks.

If you want to avoid the barrier, there is the
writel_relaxed()/readl_relaxed() that skips most barriers,
so they can be reordered against MSI interrupts and
DMA accesses on architectures that allow (e.g. ARM
or PowerPC) but not against each other.

If you do use them, I'd still expect a comment that explains
why this instance is performance critical and how it is
synchronized against DMA or MSI if necessary.

    Arnd