Re: [RFC] [PATCH v2 00/11] Optimize spi_sync path

David Jander <david@xxxxxxxxxxx> · Fri, 17 Jun 2022 14:08:49 +0200

On Thu, 16 Jun 2022 17:30:03 +0200
David Jander <david@xxxxxxxxxxx> wrote:
> [...]
> > > Ideally this would get as much different hardware testing as possible before
> > > going further upstream. Do you have access to some platforms suitable for
> > > stressing SPI with multiple clients simultaneously? Known "problematic"
> > > controllers maybe?    
> > 
> > Well, the fastest way to get it into a wide range of CI is for me to
> > apply it so people who test -next will start covering it...  I was going
> > to kick it into my test branch KernelCI once I've got it reviewed
> > properly which will get at least some boot testing on a bunch of
> > platforms.  
> 
> Ah, great. I will see if I can get it tested on some more other platforms from
> our side.

Ok, I have managed to test kernel 5.19-rc2 with this patch set on an i.MX6DL
machine. This is an arm cortex-a9 dual core (32bit), but it does have the same
spi-imx controller that my 64-bit board has. It does have two SPI peripherals,
each on a separate bus unfortunately: A tsc2046e touchscreen controller using
the IIO ADC backend driver, and a SPI NOR flash chip.
Everything works fine. Hammering the NOR flash and the touchscreen
simultaneously works as expected. But I have some interesting performance
figures for several different kernel configurations.

First of all, the spi-imx driver seems to perform really, really bad in DMA
mode in all cases. Probably something wrong with SDMA on this board, or the
transfers are too small for DMA? The driver has two module parameters: use_dma
and polling_limit_us. Disabling DMA gets the driver to "normal" performance
levels, and setting polling_limit_us to a very big value forces the driver to
work in polling mode, i.e. to not use IRQs. Since the system isn't doing
anything apart from these tests, it is expected that forced polling mode will
likely give the highest throughput, but could be considered cheating.

Test procedure:

Frist create some continuous stress on the SPI bus:

$ (while true; do dd if=/dev/mtdblock2 of=/dev/null bs=1024; done) >/dev/null
2>&1 &

The histogram shows this:

$ for t in /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_*; do printf "%s:\t%d\n" $t `cat $t`; done
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_0-1:        228583
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_1024-2047:  0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_128-255:    1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16-31:      1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16384-32767:        0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2-3:        228579
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2048-4095:  0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_256-511:    0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32-63:      1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32768-65535:        0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4-7:        1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4096-8191:  0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_512-1023:   228576
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_64-127:     0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_65536+:     0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8-15:       0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8192-16383: 0

There are no spi_async() transfers involved. The spi-nor driver apparently
only issues sync calls in this scenario, which makes sense. Note that even
though I specified "bs=1024" to the dd command, the spi-nor driver seems to
want to do 512 byte transfers anyway. Making "bs" very small doesn't change
this either, so we cannot simulate many very small transfers this way
unfortunately. This means that performance impact is likely very limited in
this scenario since the overhead of a single transfer isn't that much compared
to the size of it.

The below data shows the CPU load distribution while running the above command
(this is the topmost line of the "top" tool I have on this board) and the
throughput of a single call to dd:

Tested kernel builds:

 1. Kernel 5.19-rc2 vanilla
 1.1. use_dma = true (default):

CPU:  0.4% usr 20.4% sys  0.0% nic  0.0% idle 77.5% io  0.0% irq  1.5% sirq
3604480 bytes (3.4MB) copied, 3.685375 seconds, 955.1KB/s

 1.2. use_dma = false:

CPU:  0.5% usr 55.9% sys  0.0% nic  0.0% idle 43.2% io  0.0% irq  0.2% sirq
3604480 bytes (3.4MB) copied, 2.017914 seconds, 1.7MB/s

 1.3. forced polling:

CPU:  0.3% usr 99.4% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.1% sirq
3604480 bytes (3.4MB) copied, 1.330517 seconds, 2.6MB/s

 2. Kernel 5.19-rc2 with only the per-cpu stat patch:
 2.1. use_dma = true (default):

CPU:  0.2% usr 20.1% sys  0.0% nic  0.0% idle 78.3% io  0.0% irq  1.3% sirq
3604480 bytes (3.4MB) copied, 3.682393 seconds, 955.9KB/s

 2.2. use_dma = false:

CPU:  0.5% usr 56.8% sys  0.0% nic  0.0% idle 42.3% io  0.0% irq  0.2% sirq
3604480 bytes (3.4MB) copied, 2.015509 seconds, 1.7MB/s

 2.3. forced polling:

CPU:  0.3% usr 99.4% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.1% sirq
3604480 bytes (3.4MB) copied, 1.319247 seconds, 2.6MB/s

 3. Kernel 5.19-rc2 with both per-cpu stats and optimized sync path patches:
 3.1. use_dma = true (default):

CPU:  0.0% usr 17.8% sys  0.0% nic  0.0% idle 81.0% io  0.0% irq  1.1% sirq
3604480 bytes (3.4MB) copied, 3.650646 seconds, 964.2KB/s

 3.2. use_dma = false:

CPU:  0.6% usr 43.8% sys  0.0% nic  0.0% idle 55.2% io  0.0% irq  0.3% sirq
3604480 bytes (3.4MB) copied, 1.971478 seconds, 1.7MB/s

 3.3. forced polling:

CPU:  0.3% usr 99.6% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.0% sirq
3604480 bytes (3.4MB) copied, 1.314353 seconds, 2.6MB/s

Of course I measured each value several times. Results were very consistent,
and I picked the median result for each of the cases above.
The differences are small, as is expected given the fact that data transfers
are rather big, but specially if you look at the "seconds" value returned by
"dd", there is an incremental improvement for each of the patches. The total
improvement seems to be around 1.5-2% less CPU overhead in total... which IMHO
is still remarkable given how insensitive this test is for transfer overhead.

Testing the impact on the touchscreen driver doesn't make much sense though,
since it runs on a 1MHz clock rate and is already optimized to only do very
large single SPI transfers, so the overhead will definitely be noise at best.

Will report more results if I can test on other platforms.

Best regards,

-- 
David Jander