Hi, On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote: > > Quoting Doug Anderson (2020-12-10 09:14:15) > > > > This is my untested belief of what's happening > > > > CPU0 CPU1 > > ---- ---- > > setup_fifo_xfer() > > ... > > geni_se_setup_m_cmd() > > <hardware starts transfer> > > <unrelated interrupt storm> spin_unlock_irq() > > <continued interrupt storm> <time passes> > > <continued interrupt storm> <transfer complets in hardware> > > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> > > <continued interrupt storm> <time passes> > > <continued interrupt storm> handle_fifo_timeout() > > <continued interrupt storm> spin_lock_irq() > > <continued interrupt storm> mas->cur_xfer = NULL > > <continued interrupt storm> geni_se_cancel_m_cmd() > > <continued interrupt storm> spin_unlock_irq() > > <continued interrupt storm> wait_for_completion_timeout() => timeout > > <continued interrupt storm> spin_lock_irq() > > <continued interrupt storm> geni_se_abort_m_cmd() > > <continued interrupt storm> spin_unlock_irq() > > <continued interrupt storm> wait_for_completion_timeout() => timeout > > <interrupt storm ends> > > geni_spi_isr() > > spin_lock() > > if (m_irq & M_RX_FIFO_WATERMARK_EN) > > geni_spi_handle_rx() > > mas->cur_xfer NULL derefrence > > Ok so the one line summary is "If geni_spi_isr() is sufficiently delayed > then we may deref NULL in the irq handler because the handler tries to > deref mas->cur_xfer after the timeout handling code has set it to NULL". Sure. > CPU0 CPU1 > ---- ---- > setup_fifo_xfer() > ... > geni_se_setup_m_cmd() > <hardware starts transfer> > unrelated_irq_handler() spin_unlock_irq() > ... > <transfer completes in hardware> > <hardware sets M_RX_FIFO_WATERMARK_EN> > > handle_fifo_timeout() > spin_lock_irq() > mas->cur_xfer = NULL > geni_se_cancel_m_cmd() > spin_unlock_irq() > wait_for_completion_timeout() => timeout > spin_lock_irq() > geni_se_abort_m_cmd() > spin_unlock_irq() > wait_for_completion_timeout() => timeout > return IRQ_HANDLED; > gic_handle_irq() > geni_spi_isr() > spin_lock() > if (m_irq & M_RX_FIFO_WATERMARK_EN) > geni_spi_handle_rx() > rx_buf = mas->cur_xfer->rx_buf <--- OOPS! > > > With my proposed fix, I believe that would transform into: > > > > CPU0 CPU1 > > ---- ---- > > setup_fifo_xfer() > > ... > > geni_se_setup_m_cmd() > > <hardware starts transfer> > > <unrelated interrupt storm> spin_unlock_irq() > > <continued interrupt storm> <time passes> > > <continued interrupt storm> <transfer complets in hardware> > > <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> > > <continued interrupt storm> <time passes> > > <continued interrupt storm> handle_fifo_timeout() > > <continued interrupt storm> synchronize_irq() > > <continued interrupt storm> <time passes> > > <interrupt storm ends> > > geni_spi_isr() > > ... > > <synchronize_irq() finishes> > > spin_lock_irq() > > mas->cur_xfer = NULL > > geni_se_cancel_m_cmd() > > spin_unlock_irq() > > geni_spi_isr() > > ... > > wait_for_completion_timeout() => success > > > > The extra synchronize_irq() I was suggesting at the end of the > > function would be an extra bit of paranoia. Maybe a new storm showed > > up while we were processing the timeout? > > Shouldn't we check in the timeout logic to see if m_irq has > M_RX_FIFO_WATERMARK_EN or M_TX_FIFO_WATERMARK_EN set instead? Similarly > for the CS assert/deassert stuff. If the timeout hits but one of those > bits are set then it seems we've run into some poor irqsoff section but > the hardware is still working. Calling synchronize_irq() wouldn't help > if the CPU handling the irqs (i.e. CPU0) had irqs off for a long time, > right? It will only ensure that other irq handlers have completed, which > may be a problem, but not the only one. > > TL;DR: Peek at the irq status register in the timeout logic and skip it > if the irq is pending? I don't have tons of experience with synchronize_irq(), but the function comment seems to indicate that as long as the interrupt is pending synchronize_irq() will do what we want even if the CPU that should handle the interrupt is in an irqsoff section. Digging a little bit I guess it relies upon the interrupt controller being able to read this state, but (hopefully) the GIC can? If it doesn't work like I think it does, I'd be OK with peeking in the IRQ status register, but we shouldn't _skip_ the logic IMO. As long as we believe that an interrupt could happen in the future we shouldn't return from handle_fifo_timeout(). It's impossible to reason about how future transfers would work if the pending interrupt from the previous transfer could fire at any point. -Doug