On Mon, 23 May 2022 16:59:35 +0200 Marc Kleine-Budde <mkl@xxxxxxxxxxxxxx> wrote: > On 23.05.2022 16:48:32, David Jander wrote: > > Btw, I just discovered that there was indeed another often unnecessary context > > switch happening in spi_sync(). When spi_finalize_current_message() is called, > > it will always queue work, even if we just managed to do everything in the > > calling context: > > > > https://elixir.bootlin.com/linux/v5.18-rc7/source/drivers/spi/spi.c#L1909 > > > > This is needed to have the SPI worker thread unprepare transfer > > hardware and free the dummy buffers if required. My question is why > > this needs to be done from the thread. Putting the relevant code after > > the call to ctlr->transfer_one_message() in __spi_pump_messages(), > > saves this extra bounce to the worker thread if no more messages are > > queued, saving ca 10-12us extra time between consecutive spi_sync > > messages. > > Martin Sperl tried to do a delayed teardown, see: > > | 412e60373245 spi: core: avoid waking pump thread from spi_sync instead run teardown delayed > > But that turned out be not working properly: > > | https://lore.kernel.org/all/f86eaebb-0359-13be-f4a2-4f2b8832252e@xxxxxxxxxx/ Ah, thanks for sharing. Added Martin to CC here. I have been struggling with this too. There are definitely dragons somewhere. I have tried to do tear-down in the same context if possible, similar to this: @@ -1718,6 +1718,21 @@ static void __spi_pump_messages(struct spi_controller *ctlr, bool in_kthread) goto out; } + spin_lock_irqsave(&ctlr->queue_lock, flags); + /* Check if we can shutdown the controller now */ + if ((list_empty(&ctlr->queue) || !ctlr->running) && (!in_kthread)) { + if (!ctlr->dummy_rx && !ctlr->dummy_tx && + !ctlr->unprepare_transfer_hardware) { + spi_idle_runtime_pm(ctlr); + ctlr->busy = false; + trace_spi_controller_idle(ctlr); + } else { + /* Ensure non-atomic teardown is done in the thread */ + kthread_queue_work(ctlr->kworker, + &ctlr->pump_messages); + } + } + spin_unlock_irqrestore(&ctlr->queue_lock, flags); out: mutex_unlock(&ctlr->io_mutex); Which is almost a copy/paste from the same piece of code above. The idea was that in the case the whole transfer was done in the same context anyway, one could just check after calling ctlr->transfer_one_message() if the controller can be made idle immediately. This turned out to be racy for some reason, and I have not yet discovered exactly why. In the occasions the code didn't hit the race, it seemed to have a notable performance impact though, so removing this context switch may be worth the effort. Unfortunately, besides being racy, this change does introduce a new call to the queue_lock spinlock, which (as my previous patch shows) negates part of the performance gain. This made me think about ways to reduce these spinlocks even further, so I mapped out the different code-paths that access the queue-spinlock, the bus-spinlock, the bus-mutex and the io-mutex. It appears the code is optimized for concurrency, assuming the overhead of locking is negligible. Unfortunately we now know this isn't the case when dealing with small sync transfers at medium to high clock speeds. Currently there is a ctlr->queue and ctlr->cur_msg and some other state variables and flags that are accessed from three different places: spi_sync, spi_async and the worker thread. The locking paths look approximately like this (for one single message): 1. spi_sync() --> bus_lock_mutex --> __spi_sync() --> bus_lock_spinlock --> queue_lock --> list_add_tail() --> __spi_pump_messages() (also entered here from WQ) --> queue_lock --> list_first_entry() --> io_mutex --> transfer_one_message() --> spi_finalize_current_message --> queue_lock --> get cur_msg --> queue_lock --> kthread_queue_work() -> reschedule --> complete() ** (from thread) __spi_pump_messages() --> queue_lock --> teardown -> done. (from main context) wait_for_completion() ** --> internal spinlock. 2. worker thread: --> __spi_pump_messages() ... same as above from first __spi_pump_messages() 3. spi_async() --> bus_lock_spinlock --> __spi_async() --> queue_lock --> list_add_tail() --> kthread_queue_work() >From what I understand of this, bus_lock_mutex is used to serialize spi_sync calls for this bus, so there cannot be any concurrency from different threads doing spi_sync() calls to the same bus. This means, that if spi_sync was the only path in existence, bus_lock_mutex would suffice, and all the other spinlocks and mutexes could go. Big win. But the async path is what complicates everything. So I have been thinking... what if we could make the sync and the async case two separate paths, with two separate message queues? In fact the sync path doesn't even need a queue really, since it will just handle one message beginning to end, and not return until the message is done. It doesn't need the worker thread either AFAICS. Or am I missing something? In the end, both paths would converge at the io_mutex. I am tempted to try this route. Any thoughts? Best regards, -- David Jander