Re: [PATCH 3/3] spi: bcm2835: add module parameter to configure minimum length for dma

Lukas Wunner <lukas@xxxxxxxxx> · Sun, 24 Mar 2019 11:15:52 +0100

On Sun, Mar 24, 2019 at 09:52:15AM +0100, kernel@xxxxxxxxxxxxxxxx wrote:
> > On 24.02.2019, at 20:10, Stefan Wahren <stefan.wahren@xxxxxxxx> wrote:
> > > kernel@xxxxxxxxxxxxxxxx hat am 24. Februar 2019 um 17:23 geschrieben:
> > > Allow setting the length of the transfer at which dma is used by
> > > setting a module parameter.
> > 
> > please provide the motivation of this change.
> 
> As we provide control over the selection of polling vs. interrupt mode
> we should - for consistency - also provide control over selection of 
> DMA mode.
> 
> DMA mapping is quite expensive and with higher SPI clock speeds it
> may be more economic CPU wise to run in polling mode instead of DMA
> mode.

The problem is that making the DMA minimum length configurable
by itself impacts performance because a memory read is necessary
to retrieve the limit, instead of a hardcoded immediate in the
machine code.  Ultimately this feature is only of interest to
developers optimizing the code, not really to end users.

> Also DMA mode has one specific difference to Polling mode:
> there is no idle clock cycle between bytes transferred.

Seriously?  If that's true it should be documented in the driver.
That seems like a major advantage of DMA mode.

> This may have negative impact when transferring lots of bytes to
> some mcus without SPI buffers at the fastest possible clock speed,
> where it helps when there is a gap after each byte.

Hm, wouldn't a slower SPI clock speed achieve the same?

As a general remark, the interrupt mode is currently suboptimal
because when the TX FIFO becomes empty, there's a latency until
it is filled again.  Instead, we should try to keep it non-empty
at all times.  This can be achieved with the RXR interrupt:
It signals that >= 48 bytes are in the RX FIFO, so in theory if
we receive that interrupt, we could write 48 bytes to the TX FIFO.

The problem is, this clashes with your algorithm which tries to
stuff as many bytes as possible in the TX FIFO.  Only if we give
that FIFO stuffing algorithm up do we know for sure that 48 bytes
are free in the TX FIFO.

Also, both poll mode and interrupt mode could be sped up by
switching to pseudo-DMA mode, as I've done in 3bd7f6589f67,
i.e. switch to DMA mode but access the chip with programmed I/O.
That way, the number of MMIO accesses would be reduced by a
factor of 4.  So if the TX FIFO is empty, perform 16 writes
to fill it.  Write another 12 dwords once RXR is signaled.
Read 16 dwords upon RXF or 12 dwords upon RXR.

This would make the time spent in the IRQ handler super short,
but at the expense of receiving more interrupts.

Poll mode could function the same and precalculate the time it
takes for the TX FIFO to empty or the RX FIFO to become filled,
and usleep_range() as long to yield the CPU to other tasks.
Again, this means more wakeups for the thread.  I'm not sure
which one is the lesser evil but your FIFO stuffing algorithm
forces us to leave optimization potential on the table and that
bothers me.

Let me know what you think.

Thanks,

Lukas