Hi, On Wed, Jun 17, 2020 at 1:53 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote: > > Quoting Douglas Anderson (2020-06-16 03:40:47) > > If you added a bit of a delay (like a trace_printk) into the ISR for > > the spi-geni-qcom driver, you would suddenly start seeing some errors > > spit out. The problem was that, though the ISR itself held a lock, > > other parts of the driver didn't always grab the lock. > > > > One example race was this: > > a) Driver queues off a command to set a Chip Select (CS). > > b) ISR fires indicating the CS is done. > > c) Done bit is set, so we complete(). > > d) Second CPU gallops off and starts a transfer. > > e) Second CPU starts messing with hardware / software state (not under > > spinlock). > > f) ISR now does things like set "mas->cur_mcmd" to CMD_NONE, prints > > errors if "tx_rem_bytes" / "rx_rem_bytes" have been set, and also > > Acks all interrupts it handled. > > Can we get a CPU0/CPU1 diagram here? At point e) I got sort of lost. And > maybe it's not even a dual CPU problem? i.e. it can happen on one CPU? > > CPU0 > ---- > a) spi_geni_set_cs() > mas->cur_mcmd = CMD_CS > wait_for_completion_timeout(&xfer_done) > b) <INTERRUPT> > geni_spi_isr() > c) complete(&xfer_done); > <END INTERRUPT> > pm_runtime_put(mas->dev); > d) galloping? > > I got lost... Sorry! I think you need two CPUs, at least for the race I'm thinking of. Maybe this is clearer? CPU1: => spi_geni_set_cs() starts => spi_geni_set_cs() calls wait_for_completion_timeout(&xfer_done) CPU0: => geni_spi_isr() starts => geni_spi_isr() calls complete(&xfer_done) => geni_spi_isr() stalls CPU1: => spi_geni_set_cs() call to wait_for_completion_timeout() finishes => spi_geni_set_cs() exits. => spi_geni_transfer_one() is called => spi_geni_transfer_one() calls setup_fifo_xfer() => setup_fifo_xfer() sets "cur_mcmd" => setup_fifo_xfer() sets "tx_rem_bytes" => setup_fifo_xfer() sets "rx_rem_bytes" => setup_fifo_xfer() kicks off a transfer CPU0: => geni_spi_isr() finishes stalling => geni_spi_isr() sets "cur_mcmd" to NULL => geni_spi_isr() checks "tx_rem_bytes" to confirm it's 0. => geni_spi_isr() checks "rx_rem_bytes" to confirm it's 0. => geni_spi_isr() clears any "DONE" interrupt that is pending I can update the commit message to have that if it's helpful and makes sense. In the above example I have a fake "stall" that wouldn't really happen, but in general if adding a delay somewhere creates a race condition then the race condition was there anyway. Also, with weakly ordered memory it's possible that a write on one CPU could clobber a write made by another CPU even if they happened in opposite orders. The race is fixed by my patch because when CPU1 starts setup_fifo_xfer() it won't be able to grab the spinlock until the ISR is totally done. > > Let's fix this. Before we start messing with hardware, we'll grab the > > lock to make sure that the IRQ handler from some previous command has > > really finished. We don't need to hold the lock unless we're in a > > state where more interrupts can come in, but we at least need to make > > sure the previous IRQ is done. This lock is used exclusively to > > prevent the IRQ handler and non-IRQ from stomping on each other. The > > SPI core handles all other mutual exclusion. > > Ok maybe the problem is the completion at c) never happens until the > wait_for_completion_timeout() times out? No need for a timeout to happen. Just putting a few "trace_printk" calls in the ISR at strategic places was enough for me to see the race. > > As part of this, we change the way that the IRQ handler detects > > spurious interrupts. Previously we checked for our state variable > > being set to IRQ_NONE, but that was done outside the spinlock. We > > could move it into the spinlock, but instead let's just change it to > > look for the lack of any IRQ status bits being set. This can be done > > outside the lock--the hardware certainly isn't grabbing or looking at > > the spinlock when it updates its status register. > > > > It's possible that this will fix real (but very rare) errors seen in > > the field that look like: > > irq ...: nobody cared (try booting with the "irqpoll" option) > > > > NOTE: an alternate strategy considered here was to always make the > > complete() / spi_finalize_current_transfer() the very last thing in > > our IRQ handler. With such a change you could consider that we could > > be "lockless". In that case, though, we'd have to be very careful w/ > > memory barriers so we made sure we didn't have any bugs with weakly > > ordered memory. Using spinlocks makes the driver much easier to > > understand. > > > > Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") > > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx> > > --- > > > > Changes in v3: > > - Split out some lock cleanup to previous patch. > > - Don't need to read IRQ status register inside spinlock. > > - Don't check for state CMD_NONE; later patch is removing state var. > > - Don't hold the lock for all of setup_fifo_xfer(). > > - Comment about why it's safe to Ack interrupts at the end. > > - Subject/desc changed since race is definitely there. > > > > Changes in v2: > > - Detect true spurious interrupt. > > - Still return IRQ_NONE for state machine mismatch, but print warn. > > > > drivers/spi/spi-geni-qcom.c | 45 ++++++++++++++++++++++++++++++++++--- > > 1 file changed, 42 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > > index c7d2c7e45b3f..e0f0e5c241f3 100644 > > --- a/drivers/spi/spi-geni-qcom.c > > +++ b/drivers/spi/spi-geni-qcom.c > > @@ -367,6 +384,12 @@ static void setup_fifo_xfer(struct spi_transfer *xfer, > > } > > writel(spi_tx_cfg, se->base + SE_SPI_TRANS_CFG); > > mas->cur_mcmd = CMD_XFER; > > + > > + /* > > + * Lock around right before we start the transfer since our > > + * interrupt controller could come in at any time now. > > drop 'controller'? Sure, I'll fix in the next spin. -Doug