Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access

David Jander <david@xxxxxxxxxxx> · Tue, 24 May 2022 13:30:48 +0200




On Mon, 23 May 2022 16:59:35 +0200
Marc Kleine-Budde <mkl@xxxxxxxxxxxxxx> wrote:

> On 23.05.2022 16:48:32, David Jander wrote:
> > Btw, I just discovered that there was indeed another often unnecessary context
> > switch happening in spi_sync(). When spi_finalize_current_message() is called,
> > it will always queue work, even if we just managed to do everything in the
> > calling context:
> > 
> > https://elixir.bootlin.com/linux/v5.18-rc7/source/drivers/spi/spi.c#L1909
> > 
> > This is needed to have the SPI worker thread unprepare transfer
> > hardware and free the dummy buffers if required. My question is why
> > this needs to be done from the thread. Putting the relevant code after
> > the call to ctlr->transfer_one_message() in __spi_pump_messages(),
> > saves this extra bounce to the worker thread if no more messages are
> > queued, saving ca 10-12us extra time between consecutive spi_sync
> > messages.  
> 
> Martin Sperl tried to do a delayed teardown, see:
> 
> | 412e60373245 spi: core: avoid waking pump thread from spi_sync instead run teardown delayed
> 
> But that turned out be not working properly:
> 
> | https://lore.kernel.org/all/f86eaebb-0359-13be-f4a2-4f2b8832252e@xxxxxxxxxx/

Ah, thanks for sharing. Added Martin to CC here.

I have been struggling with this too. There are definitely dragons somewhere.
I have tried to do tear-down in the same context if possible, similar to this:

@@ -1718,6 +1718,21 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread)
                goto out;
        }
 
+       spin_lock_irqsave(&ctlr->queue_lock, flags);
+       /* Check if we can shutdown the controller now */
+       if ((list_empty(&ctlr->queue) || !ctlr->running) && (!in_kthread)) {
+               if (!ctlr->dummy_rx && !ctlr->dummy_tx &&
+                   !ctlr->unprepare_transfer_hardware) {
+                       spi_idle_runtime_pm(ctlr);
+                       ctlr->busy = false;
+                       trace_spi_controller_idle(ctlr);
+               } else {
+                       /* Ensure non-atomic teardown is done in the thread */
+                       kthread_queue_work(ctlr->kworker,
+                                          &ctlr->pump_messages);
+               }
+       }
+       spin_unlock_irqrestore(&ctlr->queue_lock, flags);
 out:
        mutex_unlock(&ctlr->io_mutex);
 
Which is almost a copy/paste from the same piece of code above. The idea was
that in the case the whole transfer was done in the same context anyway, one
could just check after calling ctlr->transfer_one_message() if the controller
can be made idle immediately. This turned out to be racy for some reason, and
I have not yet discovered exactly why. In the occasions the code didn't hit
the race, it seemed to have a notable performance impact though, so removing
this context switch may be worth the effort.

Unfortunately, besides being racy, this change does introduce a new call to
the queue_lock spinlock, which (as my previous patch shows) negates part of
the performance gain.

This made me think about ways to reduce these spinlocks even further, so I
mapped out the different code-paths that access the queue-spinlock, the
bus-spinlock, the bus-mutex and the io-mutex. It appears the code is optimized
for concurrency, assuming the overhead of locking is negligible. Unfortunately
we now know this isn't the case when dealing with small sync transfers at
medium to high clock speeds. Currently there is a ctlr->queue and
ctlr->cur_msg and some other state variables and flags that are accessed from
three different places: spi_sync, spi_async and the worker thread. The locking
paths look approximately like this (for one single message):

 1. spi_sync()
    --> bus_lock_mutex
        --> __spi_sync()
            --> bus_lock_spinlock
                --> queue_lock
                    --> list_add_tail()
            --> __spi_pump_messages() (also entered here from WQ)
                --> queue_lock
                    --> list_first_entry()
                --> io_mutex
                    --> transfer_one_message()
                        --> spi_finalize_current_message
                            --> queue_lock
                                --> get cur_msg
                            --> queue_lock
                                --> kthread_queue_work() -> reschedule
                            --> complete() **
    (from thread) __spi_pump_messages()
                    --> queue_lock
                        --> teardown -> done.
    (from main context) wait_for_completion() **
                        --> internal spinlock.

 2. worker thread:
    --> __spi_pump_messages()
        ... same as above from first __spi_pump_messages()

 3. spi_async()
    --> bus_lock_spinlock
        --> __spi_async()
            --> queue_lock
                --> list_add_tail()
                --> kthread_queue_work()

>From what I understand of this, bus_lock_mutex is used to serialize spi_sync
calls for this bus, so there cannot be any concurrency from different threads
doing spi_sync() calls to the same bus. This means, that if spi_sync was the
only path in existence, bus_lock_mutex would suffice, and all the other
spinlocks and mutexes could go. Big win. But the async path is what
complicates everything. So I have been thinking... what if we could make the
sync and the async case two separate paths, with two separate message queues?
In fact the sync path doesn't even need a queue really, since it will just
handle one message beginning to end, and not return until the message is done.
It doesn't need the worker thread either AFAICS. Or am I missing something?
In the end, both paths would converge at the io_mutex. I am tempted to try
this route. Any thoughts?

Best regards,

-- 
David Jander