Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access

David Jander <david@xxxxxxxxxxx> · Tue, 17 May 2022 15:09:06 +0200

On Tue, 17 May 2022 12:57:18 +0100
Mark Brown <broonie@xxxxxxxxxx> wrote:

> On Tue, May 17, 2022 at 12:24:39PM +0200, David Jander wrote:
> 
> > (mainly in spi.c for now). Time interrupt line stays low:  
> 
> >  1. Kernel 5.18-rc1 with only polling patches from spi-next: 135us  
> 
> >  2. #if 0 around all stats and accounting calls: 100us  
> 
> >  3. The _fast API of my original RFC: 55us  
> 
> > This shows that the accounting code is a bit less than half of the dispensable
> > overhead for my use-case. Indeed an easy target.  
> 
> Good.
> 
> > on, so I wonder whether there is something to gain if one could just call
> > spi_bus_lock() at the start of several such small sync transfers and use
> > non-locking calls (skip the queue lock and io_mutex)? Not sure that would have
> > a meaningful impact, but to get an idea, I replaced the bus_lock_spinlock and
> > queue_lock in __spi_sync() and __spi_queued_transfer() with the bare code in
> > __spi_queued_transfer(), since it won't submit work to the queue in this case
> > anyway. The resulting interrupt-active time decreased by another 4us, which is
> > approximately 5% of the dispensable overhead. For the record, that's 2us per
> > spinlock lock/unlock pair.  
> 
> I do worry about how this might perform under different loads where
> there are things coming in from more than one thread.

I understand your concern. The biggest issue is probably the fact that client
drivers can claim exclusive access to the bus this way, and who knows if they
take care to not claim it for too long?
In the end, if multiple latency-sensitive client drivers share the same SPI
bus, besides this being a questionable HW design, this will be asking for
trouble. But I agree that usage of spi_bus_lock() by client drivers is
probably not a good idea.

> > > One thing that might be useful would be if we could start the initial
> > > status read message from within the hard interrupt handler of the client
> > > driver with the goal that by the time it's threaded interrupt handler
> > > runs we might have the data available.  That could go wrong on a lightly
> > > loaded system where we might end up running the threaded handler while
> > > the transfer is still running, OTOH if it's lightly loaded that might
> > > not matter.  Or perhaps just use a completion from the SPI operation and
> > > not bother with the threaded handler at all.  
> 
> > You mean ("ctx" == context switch):  
> 
> >  1. hard-IRQ, queue msg --ctx--> SPI worker, call msg->complete() which does
> >  thread IRQ work (but can only do additional sync xfers from this context).  
> 
> > vs.  
> 
> >  2. hard-IRQ, queue msg --ctx--> SPI worker, call completion --ctx--> IRQ
> >  thread wait for completion and does more xfers...  
> 
> > vs (and this was my idea).  
> 
> >  3. hard-IRQ, pump FIFO (if available) --ctx--> IRQ thread, poll FIFO, do more
> >  sync xfers...  
> 
> Roughly 1, but with a lot of overlap with option 3.  I'm unclear what
> you mean by "queue message" here.

In the above, I meant:

 "queue message": 
	list_add_tail(&msg->queue, &ctlr->queue);
	kthread_queue_work(ctlr->kworker, &ctlr->pump_messages);

 "pump FIFO":
	A special function in the spi driver that fills the TX FIFO so that a
	transfer starts and returns immediately.

 "poll FIFO": 
	Busy-loop wait until TX-FIFO is empty and RX-FIFO has the
	complete RX message.

> > Option 3 would require a separation of spi_sync_transfer into two halves. One
> > half just activates CS (non-sleep GPIO api!) and fills the FIFO. The second
> > half polls the FIFO for transfer completion. This path could only be chosen if
> > the SPI controller has a FIFO that can hold the whole message. In other words a
> > lot of spacial case handling for what it is worth probably... but still
> > interesting.  
> 
> Yes, that's the whole point.  This also flows nicely when you've got a
> queue since you can restart the hardware from the interrupt context
> without waiting to complete the transfer that just finished.

Ack. Only caveat I see is the requirement for CS to be settable in a
non-sleepable context. Not all GPIO pins have gpiod_set_value(). Some might
only have gpiod_set_value_cansleep(). Although the latter case is not a good
choice for a CS signal, so I doubt it can be found in the wild.
Example: I2C GPIO expanders. In such a (very unlikely) case, the spi subsystem
would need to fall back to the worker-queue, so probably not a big deal.

> > Option 2 is probably not that bad if the SPI worker can run on another core?  
> 
> Pretty much anything benefits with another core.

Ack. But there are still quite a few single-core SoCs out there, and for
those, Option 2 is the worst.

Best regards,

-- 
David Jander