Quoting Douglas Anderson (2020-06-16 03:40:47) > If you added a bit of a delay (like a trace_printk) into the ISR for > the spi-geni-qcom driver, you would suddenly start seeing some errors > spit out. The problem was that, though the ISR itself held a lock, > other parts of the driver didn't always grab the lock. > > One example race was this: > a) Driver queues off a command to set a Chip Select (CS). > b) ISR fires indicating the CS is done. > c) Done bit is set, so we complete(). > d) Second CPU gallops off and starts a transfer. > e) Second CPU starts messing with hardware / software state (not under > spinlock). > f) ISR now does things like set "mas->cur_mcmd" to CMD_NONE, prints > errors if "tx_rem_bytes" / "rx_rem_bytes" have been set, and also > Acks all interrupts it handled. Can we get a CPU0/CPU1 diagram here? At point e) I got sort of lost. And maybe it's not even a dual CPU problem? i.e. it can happen on one CPU? CPU0 ---- a) spi_geni_set_cs() mas->cur_mcmd = CMD_CS wait_for_completion_timeout(&xfer_done) b) <INTERRUPT> geni_spi_isr() c) complete(&xfer_done); <END INTERRUPT> pm_runtime_put(mas->dev); d) galloping? I got lost... Sorry! > > Let's fix this. Before we start messing with hardware, we'll grab the > lock to make sure that the IRQ handler from some previous command has > really finished. We don't need to hold the lock unless we're in a > state where more interrupts can come in, but we at least need to make > sure the previous IRQ is done. This lock is used exclusively to > prevent the IRQ handler and non-IRQ from stomping on each other. The > SPI core handles all other mutual exclusion. Ok maybe the problem is the completion at c) never happens until the wait_for_completion_timeout() times out? > > As part of this, we change the way that the IRQ handler detects > spurious interrupts. Previously we checked for our state variable > being set to IRQ_NONE, but that was done outside the spinlock. We > could move it into the spinlock, but instead let's just change it to > look for the lack of any IRQ status bits being set. This can be done > outside the lock--the hardware certainly isn't grabbing or looking at > the spinlock when it updates its status register. > > It's possible that this will fix real (but very rare) errors seen in > the field that look like: > irq ...: nobody cared (try booting with the "irqpoll" option) > > NOTE: an alternate strategy considered here was to always make the > complete() / spi_finalize_current_transfer() the very last thing in > our IRQ handler. With such a change you could consider that we could > be "lockless". In that case, though, we'd have to be very careful w/ > memory barriers so we made sure we didn't have any bugs with weakly > ordered memory. Using spinlocks makes the driver much easier to > understand. > > Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx> > --- > > Changes in v3: > - Split out some lock cleanup to previous patch. > - Don't need to read IRQ status register inside spinlock. > - Don't check for state CMD_NONE; later patch is removing state var. > - Don't hold the lock for all of setup_fifo_xfer(). > - Comment about why it's safe to Ack interrupts at the end. > - Subject/desc changed since race is definitely there. > > Changes in v2: > - Detect true spurious interrupt. > - Still return IRQ_NONE for state machine mismatch, but print warn. > > drivers/spi/spi-geni-qcom.c | 45 ++++++++++++++++++++++++++++++++++--- > 1 file changed, 42 insertions(+), 3 deletions(-) > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > index c7d2c7e45b3f..e0f0e5c241f3 100644 > --- a/drivers/spi/spi-geni-qcom.c > +++ b/drivers/spi/spi-geni-qcom.c > @@ -367,6 +384,12 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > } > writel(spi_tx_cfg, se->base + SE_SPI_TRANS_CFG); > mas->cur_mcmd = CMD_XFER; > + > + /* > + * Lock around right before we start the transfer since our > + * interrupt controller could come in at any time now. drop 'controller'? > + */ > + spin_lock_irq(&mas->lock); > geni_se_setup_m_cmd(se, m_cmd, FRAGMENTATION); > > /*