Hi, On Wed, Dec 2, 2020 at 11:45 PM Roja Rani Yarubandi <rojay@xxxxxxxxxxxxxx> wrote: > > Here, there is a chance of race condition occurrence which leads to > NULL pointer dereference with struct spi_geni_master member 'cur_xfer' > between setup_fifo_xfer() and handle_fifo_timeout() functions. > > Fix this race condition with guarding the 'cur_xfer' where it gets updated, > with spin_lock_irq/spin_unlock_irq in setup_fifo_xfer() as we do in > handle_fifo_timeout() function. > > Call trace: > geni_spi_isr+0x114/0x34c > __handle_irq_event_percpu+0xe0/0x23c > handle_irq_event_percpu+0x34/0x8c > handle_irq_event+0x48/0x94 > handle_fasteoi_irq+0xd0/0x140 > __handle_domain_irq+0x8c/0xcc > gic_handle_irq+0x114/0x1dc > el1_irq+0xcc/0x180 > geni_spi a80000.spi: Failed to cancel/abort m_cmd > dev_watchdog+0x348/0x354 > call_timer_fn+0xc4/0x220 > __run_timers+0x228/0x2d4 > spi_master spi6: failed to transfer one message from queue > run_timer_softirq+0x24/0x44 > __do_softirq+0x16c/0x344 > irq_exit+0xa8/0xac > __handle_domain_irq+0x94/0xcc > gic_handle_irq+0x114/0x1dc > el1_irq+0xcc/0x180 > cpuidle_enter_state+0xf8/0x204 > cpuidle_enter+0x38/0x4c > cros-ec-spi spi6.0: spi transfer failed: -110 Thanks for the patch! I'm not convinced about your explanation, though. The overall assumption in setup_fifo_xfer(), as indicated by the big comment at the start of the function, is that once the spin_lock_irq() / spin_unlock_irq() dance happens at the start of that function that no more interrupts will come in until the later lock. If that assumption is not true then we need to look at the whole function, not just bandaid where we're setting "cur_xfer". Specifically if transfers are still happening all the other things that setup_fifo_xfer() is doing are also all bad ideas. I guess the first question is: why are you even getting a timeout? SPI is clocked by the master and there's nothing the slave can do to delay a transfer. If you're getting a timeout in the first place it means there's something pretty wrong. Maybe that would be the first thing to look at. I guess the most likely case is that something else in the system is somehow blocking our interrupt handler from running? The second question is: what can we do about the "Failed to cancel/abort m_cmd" message above. I guess if our interrupt handler is blocked that would explain why the cancel and abort didn't work. It's pretty hard to figure out what to do to recover at this point. When the function exits it assumes that the transfer has been canceled / aborted, but it hasn't. This seems like it has the ability to throw everything for a loop. If an interrupt can come in for the previous transfer after handle_fifo_timeout() exits it could cause all sorts of problems, right? It seems like one thing that might help would be to add this to the start of handle_fifo_timeout(): dev_warn(mas->dev, "Timeout; synchronizing IRQ\n"); synchronize_irq(mas->irq); Maybe also add a synchronize_irq() at the end of the function too? If my theory is correct and the whole problem here is that our interrupt is blocked, I think that will make us block, which is probably the best we can do and better than just returning and getting the interrupt later. That all being said, looking at all this makes me believe that the NULL dereference is because of commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state variable"). There, I added "mas->cur_xfer = NULL" into handle_fifo_timeout(). That should be the right thing to do (without that we might have been reading bogus data / writing to memory we shouldn't), but I think geni_spi_handle_rx() and geni_spi_handle_tx() don't handle it properly. I think we'll dereference it without checking. :( It would probably be good to fix this too? I would guess that if "mas->cur_xfer" is NULL then geni_spi_handle_rx() should read all data in the FIFO and throw it away and geni_spi_handle_tx() should set SE_GENI_TX_WATERMARK_REG to 0. NOTE: I _think_ that with the synchronize_irq() I'm suggesting above we'll avoid this case, but it never hurts to be defensive. Does that all make sense? So the summary is that instead of your patch: 1. Add synchronize_irq() at the start and end of handle_fifo_timeout(). Not under lock. 2. In geni_spi_handle_rx(), check for NULL "mas->cur_xfer". Read all data in the FIFO (don't cap at rx_rem_bytes), but throw it away. 3. In geni_spi_handle_tx(), check for NULL "mas->cur_xfer". Don't write any data. Just write 0 to SE_GENI_TX_WATERMARK_REG. I think #1 is the real fix, but #2 and #3 will avoid crashes in case there's another bug somewhere. -Doug