Hi Lukas! > On 24.03.2019, at 11:15, Lukas Wunner <lukas@xxxxxxxxx> wrote: > > On Sun, Mar 24, 2019 at 09:52:15AM +0100, kernel@xxxxxxxxxxxxxxxx wrote: >>> On 24.02.2019, at 20:10, Stefan Wahren <stefan.wahren@xxxxxxxx> wrote: >>>> kernel@xxxxxxxxxxxxxxxx hat am 24. Februar 2019 um 17:23 geschrieben: >>>> Allow setting the length of the transfer at which dma is used by >>>> setting a module parameter. >>> >>> please provide the motivation of this change. >> >> As we provide control over the selection of polling vs. interrupt mode >> we should - for consistency - also provide control over selection of >> DMA mode. >> >> DMA mapping is quite expensive and with higher SPI clock speeds it >> may be more economic CPU wise to run in polling mode instead of DMA >> mode. > > The problem is that making the DMA minimum length configurable > by itself impacts performance because a memory read is necessary > to retrieve the limit, instead of a hardcoded immediate in the > machine code. Ultimately this feature is only of interest to > developers optimizing the code, not really to end users. The host path is IMO not so hot that one additional memory read is that expensive. For all practical purposes - at least with DMA or IRQ driven mode - your “hot" code path will (typically) no longer be in cache by the time that the code executes again, as other processes/threads have probably been scheduled on the cpu and the cache lines for the code will be gone anyway. > >> Also DMA mode has one specific difference to Polling mode: >> there is no idle clock cycle between bytes transferred. > > Seriously? If that's true it should be documented in the driver. > That seems like a major advantage of DMA mode. It is true - not being documented explicitly is also true (I believe it was mentioned in a cover letter). But it is NOT always an advantage as said. Also I find it surprising that you have not looked at spi traces on a logic-analyzer where this is immediately visible. > >> This may have negative impact when transferring lots of bytes to >> some mcus without SPI buffers at the fastest possible clock speed, >> where it helps when there is a gap after each byte. > > Hm, wouldn't a slower SPI clock speed achieve the same? Yes, it would, but then you would need to make the SPI clock cycle possibly 3-4 times as long to communicate with an atmega MCU in slave mode, which is essentially wasting possible transfer rates. > > As a general remark, the interrupt mode is currently suboptimal > because when the TX FIFO becomes empty, there's a latency until > it is filled again. Instead, we should try to keep it non-empty > at all times. This can be achieved with the RXR interrupt: > It signals that >= 48 bytes are in the RX FIFO, so in theory if > we receive that interrupt, we could write 48 bytes to the TX FIFO. > > The problem is, this clashes with your algorithm which tries to > stuff as many bytes as possible in the TX FIFO. Only if we give > that FIFO stuffing algorithm up do we know for sure that 48 bytes > are free in the TX FIFO. > > Also, both poll mode and interrupt mode could be sped up by > switching to pseudo-DMA mode, as I've done in 3bd7f6589f67, > i.e. switch to DMA mode but access the chip with programmed I/O. > That way, the number of MMIO accesses would be reduced by a > factor of 4. So if the TX FIFO is empty, perform 16 writes > to fill it. Write another 12 dwords once RXR is signaled. > Read 16 dwords upon RXF or 12 dwords upon RXR. How would you speed up poll mode really? It is polling and consuming CPU cycles anyway! > > This would make the time spent in the IRQ handler super short, > but at the expense of receiving more interrupts. In my experience minimizing interrupts should be the main goal because this adds long “stalls” at high SPI clock speeds. I do not have the exact numbers now, but there is a latency that typically produces gaps in the order of 2-10us (depending on clock frequencies) So my original modifications of the driver is focused on avoiding those in the first place. Also on single core machines (like the original RPI or RPIzero) every interrupt means interruption of normal processes for quite some time. So this should be avoided in the first place. And that is where the actual limit of 96 bytes for DMA came from! At 96 (= 2 * 48) bytes we would run 2 interrupts in interrupt mode and also 2 interrupts in DMA mode (one for RX one for TX) > > Poll mode could function the same and precalculate the time it > takes for the TX FIFO to empty or the RX FIFO to become filled, > and usleep_range() as long to yield the CPU to other tasks. > Again, this means more wakeups for the thread. I'm not sure > which one is the lesser evil but your FIFO stuffing algorithm > forces us to leave optimization potential on the table and that > bothers me. Poll mode is wasting time to be as efficient as possible from the thru-put perspective by trying to avoid unnecessary interrupts and thread wakeup latencies, which are not guaranteed to happen as fast as possible. Right now the implementation is wasting a bit more time filling the fifo but if that was faster, then you would still be reading the status register in a tight loop, which does not make a huge difference on the internal BUS - it is a transfer. This is especially efficient when running the spi pump “inline" in spi_sync - no wakeups, no interrupts no context switches… That is the only way to get SPI transfer latencies down to ~6us between 2 transfers (that is the time between stop the clock on the first transfer and start the clock for the second transfer. Otherwise you can get long latencies in the order of 1ms, which is prohibitively long and just wastes resources on scheduling, context switches,… At one point I was thinking if a “shared” polling infrastructure (where multiple busses could poll in a single shared tread) would help. But that resulted in realization that for this to work we would need to schedule at least 2 thread-wakeups: one to the “shared” polling thread (even if filling the FIFO could get doe before scheduling the wakeup) and another one to wake spi_sync again. Assuming that the polling thread is already running, it would mean that the wakeup is not needed really. But this would still leave us with one thread-wakeup. In general principle there are other low hanging fruit that would improve CPU efficiency much more! E.g: unnecessary wakeups of the spi pump thread when running in sync mode where we consume about 30% of a single CPU just for these wakeups when running 22k spi messages/second. In total this resulted in 130% CPU utilization in 2 threads plus sometimes async scheduling... If you want to give optimizing things a try (e.g: pseudo DMA mode), then please go head and post patches and I will try to give it a try when triggering those. Finally: maybe this “use DMA/polling” decission settings should be handled by spi: core instead so that these can get set in the device tree on a per spi device not as a general bus policy. But I guess we have now moved from discussing a specific patch to a discussion of “visions” for individual drivers and the spi-core. Martin