Quoting Doug Anderson (2020-06-17 14:19:29) > On Wed, Jun 17, 2020 at 1:53 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote: > > > > Quoting Douglas Anderson (2020-06-16 03:40:47) > > > If you added a bit of a delay (like a trace_printk) into the ISR for > > > the spi-geni-qcom driver, you would suddenly start seeing some errors > > > spit out. The problem was that, though the ISR itself held a lock, > > > other parts of the driver didn't always grab the lock. > > > > > > One example race was this: > > > a) Driver queues off a command to set a Chip Select (CS). > > > b) ISR fires indicating the CS is done. > > > c) Done bit is set, so we complete(). > > > d) Second CPU gallops off and starts a transfer. > > > e) Second CPU starts messing with hardware / software state (not under > > > spinlock). > > > f) ISR now does things like set "mas->cur_mcmd" to CMD_NONE, prints > > > errors if "tx_rem_bytes" / "rx_rem_bytes" have been set, and also > > > Acks all interrupts it handled. > > > > Can we get a CPU0/CPU1 diagram here? At point e) I got sort of lost. And > > maybe it's not even a dual CPU problem? i.e. it can happen on one CPU? > > > > CPU0 > > ---- > > a) spi_geni_set_cs() > > mas->cur_mcmd = CMD_CS > > wait_for_completion_timeout(&xfer_done) > > b) <INTERRUPT> > > geni_spi_isr() > > c) complete(&xfer_done); > > <END INTERRUPT> > > pm_runtime_put(mas->dev); > > d) galloping? > > > > I got lost... Sorry! > > I think you need two CPUs, at least for the race I'm thinking of. > Maybe this is clearer? With threaded irqs I think you only need one CPU, but that's just a minor detail. Drawing it with two CPUs is clearer and easier to understand. > > CPU1: > => spi_geni_set_cs() starts > => spi_geni_set_cs() calls wait_for_completion_timeout(&xfer_done) > CPU0: > => geni_spi_isr() starts > => geni_spi_isr() calls complete(&xfer_done) > => geni_spi_isr() stalls > CPU1: > => spi_geni_set_cs() call to wait_for_completion_timeout() finishes > => spi_geni_set_cs() exits. > => spi_geni_transfer_one() is called > => spi_geni_transfer_one() calls setup_fifo_xfer() > => setup_fifo_xfer() sets "cur_mcmd" > => setup_fifo_xfer() sets "tx_rem_bytes" > => setup_fifo_xfer() sets "rx_rem_bytes" > => setup_fifo_xfer() kicks off a transfer > CPU0: > => geni_spi_isr() finishes stalling > => geni_spi_isr() sets "cur_mcmd" to NULL > => geni_spi_isr() checks "tx_rem_bytes" to confirm it's 0. > => geni_spi_isr() checks "rx_rem_bytes" to confirm it's 0. > => geni_spi_isr() clears any "DONE" interrupt that is pending > > I can update the commit message to have that if it's helpful and makes > sense. In the above example I have a fake "stall" that wouldn't > really happen, but in general if adding a delay somewhere creates a > race condition then the race condition was there anyway. Also, with > weakly ordered memory it's possible that a write on one CPU could > clobber a write made by another CPU even if they happened in opposite > orders. > > The race is fixed by my patch because when CPU1 starts > setup_fifo_xfer() it won't be able to grab the spinlock until the ISR > is totally done. Ok. This would be the diagram then if it looked like this: CPU0 CPU1 ---- ---- spi_geni_set_cs() mas->cur_mcmd = CMD_CS; geni_se_setup_m_cmd(...) wait_for_completion_timeout(&xfer_done); <INTERRUPT> geni_spi_isr() complete(&xfer_done); <wakeup> pm_runtime_put(mas->dev); ... // back to SPI core spi_geni_transfer_one() setup_fifo_xfer() mas->cur_mcmd = CMD_XFER; mas->cur_cmd = CMD_NONE; // bad! return IRQ_HANDLED; Time flows down as it usually does in these diagrams. No need to put stalls in. Reading all those lines and holding which CPU they're running on makes it harder for me to see the two running in parallel like is shown above.