Hi, On Wed, Dec 9, 2020 at 7:17 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote: > > Quoting Doug Anderson (2020-12-03 08:40:46) > > > I would guess that if "mas->cur_xfer" is NULL then > > geni_spi_handle_rx() should read all data in the FIFO and throw it > > away and geni_spi_handle_tx() should set SE_GENI_TX_WATERMARK_REG to > > 0. NOTE: I _think_ that with the synchronize_irq() I'm suggesting > > above we'll avoid this case, but it never hurts to be defensive. > > > > > > Does that all make sense? So the summary is that instead of your patch: > > Can we get a CPU diagram describing the race and scenario where this > happens? Something like: > > CPU0 CPU1 > ---- ---- > setup_fifo_xfer() > spin_lock_irq(&mas->lock); > spin_unlock_irq(&mas->lock); > mas->cur_xfer = xfer > ... > <IRQ> > geni_spi_isr() > geni_spi_handle_rx() > <NULL deref boom explosion!> > > But obviously this example diagram is incorrect and some timeout happens > instead? Sorry, I'm super lazy and don't want to read many paragraphs of > text. :) I'd rather have a diagram like above that clearly points out > the steps taken to the NULL pointer deref. This is my untested belief of what's happening CPU0 CPU1 ---- ---- setup_fifo_xfer() ... geni_se_setup_m_cmd() <hardware starts transfer> <unrelated interrupt storm> spin_unlock_irq() <continued interrupt storm> <time passes> <continued interrupt storm> <transfer complets in hardware> <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> <continued interrupt storm> <time passes> <continued interrupt storm> handle_fifo_timeout() <continued interrupt storm> spin_lock_irq() <continued interrupt storm> mas->cur_xfer = NULL <continued interrupt storm> geni_se_cancel_m_cmd() <continued interrupt storm> spin_unlock_irq() <continued interrupt storm> wait_for_completion_timeout() => timeout <continued interrupt storm> spin_lock_irq() <continued interrupt storm> geni_se_abort_m_cmd() <continued interrupt storm> spin_unlock_irq() <continued interrupt storm> wait_for_completion_timeout() => timeout <interrupt storm ends> geni_spi_isr() spin_lock() if (m_irq & M_RX_FIFO_WATERMARK_EN) geni_spi_handle_rx() mas->cur_xfer NULL derefrence With my proposed fix, I believe that would transform into: CPU0 CPU1 ---- ---- setup_fifo_xfer() ... geni_se_setup_m_cmd() <hardware starts transfer> <unrelated interrupt storm> spin_unlock_irq() <continued interrupt storm> <time passes> <continued interrupt storm> <transfer complets in hardware> <continued interrupt storm> <hardware sets M_RX_FIFO_WATERMARK_EN> <continued interrupt storm> <time passes> <continued interrupt storm> handle_fifo_timeout() <continued interrupt storm> synchronize_irq() <continued interrupt storm> <time passes> <interrupt storm ends> geni_spi_isr() ... <synchronize_irq() finishes> spin_lock_irq() mas->cur_xfer = NULL geni_se_cancel_m_cmd() spin_unlock_irq() geni_spi_isr() ... wait_for_completion_timeout() => success The extra synchronize_irq() I was suggesting at the end of the function would be an extra bit of paranoia. Maybe a new storm showed up while we were processing the timeout? > > 1. Add synchronize_irq() at the start and end of > > handle_fifo_timeout(). Not under lock. > > > > 2. In geni_spi_handle_rx(), check for NULL "mas->cur_xfer". Read all > > data in the FIFO (don't cap at rx_rem_bytes), but throw it away. > > > > 3. In geni_spi_handle_tx(), check for NULL "mas->cur_xfer". Don't > > write any data. Just write 0 to SE_GENI_TX_WATERMARK_REG. > > > > I think #1 is the real fix, but #2 and #3 will avoid crashes in case > > there's another bug somewhere. > > Aren't 2 and 3 papering over some weird problem though where irqs are > coming in unexpectedly? I think that's what I said but in different words? #1 is the real fix but #2 and #3 will keep us from crashing (AKA paper over) if we have some other (unexpected) bug. We'll already have an error in the log in this case "Failed to cancel/abort m_cmd" so it doesn't feel necessary to crash with a NULL dereference... -Doug