Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr

Doug Anderson <dianders@xxxxxxxxxxxx> · Thu, 10 Dec 2020 15:07:39 -0800

Hi,

On Thu, Dec 10, 2020 at 2:58 PM Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
>
> Quoting Doug Anderson (2020-12-10 09:14:15)
> >
> > This is my untested belief of what's happening
> >
> >  CPU0                                CPU1
> >  ----                                ----
> >                                      setup_fifo_xfer()
> >                                       ...
> >                                       geni_se_setup_m_cmd()
> >                                       <hardware starts transfer>
> >  <unrelated interrupt storm>          spin_unlock_irq()
> >  <continued interrupt storm>         <time passes>
> >  <continued interrupt storm>         <transfer complets in hardware>
> >  <continued interrupt storm>         <hardware sets M_RX_FIFO_WATERMARK_EN>
> >  <continued interrupt storm>         <time passes>
> >  <continued interrupt storm>         handle_fifo_timeout()
> >  <continued interrupt storm>          spin_lock_irq()
> >  <continued interrupt storm>          mas->cur_xfer = NULL
> >  <continued interrupt storm>          geni_se_cancel_m_cmd()
> >  <continued interrupt storm>          spin_unlock_irq()
> >  <continued interrupt storm>          wait_for_completion_timeout() => timeout
> >  <continued interrupt storm>          spin_lock_irq()
> >  <continued interrupt storm>          geni_se_abort_m_cmd()
> >  <continued interrupt storm>          spin_unlock_irq()
> >  <continued interrupt storm>          wait_for_completion_timeout() => timeout
> >  <interrupt storm ends>
> >  geni_spi_isr()
> >   spin_lock()
> >   if (m_irq & M_RX_FIFO_WATERMARK_EN)
> >    geni_spi_handle_rx()
> >     mas->cur_xfer NULL derefrence
>
> Ok so the one line summary is "If geni_spi_isr() is sufficiently delayed
> then we may deref NULL in the irq handler because the handler tries to
> deref mas->cur_xfer after the timeout handling code has set it to NULL".

Sure.

>   CPU0                                CPU1
>   ----                                ----
>                                       setup_fifo_xfer()
>                                        ...
>                                        geni_se_setup_m_cmd()
>                                        <hardware starts transfer>
>   unrelated_irq_handler()              spin_unlock_irq()
>    ...
>                                         <transfer completes in hardware>
>                                         <hardware sets M_RX_FIFO_WATERMARK_EN>
>
>                                        handle_fifo_timeout()
>                                         spin_lock_irq()
>                                         mas->cur_xfer = NULL
>                                         geni_se_cancel_m_cmd()
>                                         spin_unlock_irq()
>                                         wait_for_completion_timeout() => timeout
>                                         spin_lock_irq()
>                                         geni_se_abort_m_cmd()
>                                         spin_unlock_irq()
>                                         wait_for_completion_timeout() => timeout
>    return IRQ_HANDLED;
>   gic_handle_irq()
>    geni_spi_isr()
>     spin_lock()
>     if (m_irq & M_RX_FIFO_WATERMARK_EN)
>      geni_spi_handle_rx()
>       rx_buf = mas->cur_xfer->rx_buf <--- OOPS!
>
> > With my proposed fix, I believe that would transform into:
> >
> >  CPU0                                CPU1
> >  ----                                ----
> >                                      setup_fifo_xfer()
> >                                       ...
> >                                       geni_se_setup_m_cmd()
> >                                       <hardware starts transfer>
> >  <unrelated interrupt storm>          spin_unlock_irq()
> >  <continued interrupt storm>         <time passes>
> >  <continued interrupt storm>         <transfer complets in hardware>
> >  <continued interrupt storm>         <hardware sets M_RX_FIFO_WATERMARK_EN>
> >  <continued interrupt storm>         <time passes>
> >  <continued interrupt storm>         handle_fifo_timeout()
> >  <continued interrupt storm>          synchronize_irq()
> >  <continued interrupt storm>           <time passes>
> >  <interrupt storm ends>
> >  geni_spi_isr()
> >   ...
> >                                        <synchronize_irq() finishes>
> >                                       spin_lock_irq()
> >                                       mas->cur_xfer = NULL
> >                                       geni_se_cancel_m_cmd()
> >                                       spin_unlock_irq()
> >  geni_spi_isr()
> >    ...
> >                                       wait_for_completion_timeout() => success
> >
> > The extra synchronize_irq() I was suggesting at the end of the
> > function would be an extra bit of paranoia.  Maybe a new storm showed
> > up while we were processing the timeout?
>
> Shouldn't we check in the timeout logic to see if m_irq has
> M_RX_FIFO_WATERMARK_EN or M_TX_FIFO_WATERMARK_EN set instead? Similarly
> for the CS assert/deassert stuff. If the timeout hits but one of those
> bits are set then it seems we've run into some poor irqsoff section but
> the hardware is still working. Calling synchronize_irq() wouldn't help
> if the CPU handling the irqs (i.e. CPU0) had irqs off for a long time,
> right? It will only ensure that other irq handlers have completed, which
> may be a problem, but not the only one.
>
> TL;DR: Peek at the irq status register in the timeout logic and skip it
> if the irq is pending?

I don't have tons of experience with synchronize_irq(), but the
function comment seems to indicate that as long as the interrupt is
pending synchronize_irq() will do what we want even if the CPU that
should handle the interrupt is in an irqsoff section.  Digging a
little bit I guess it relies upon the interrupt controller being able
to read this state, but (hopefully) the GIC can?

If it doesn't work like I think it does, I'd be OK with peeking in the
IRQ status register, but we shouldn't _skip_ the logic IMO.  As long
as we believe that an interrupt could happen in the future we
shouldn't return from handle_fifo_timeout().  It's impossible to
reason about how future transfers would work if the pending interrupt
from the previous transfer could fire at any point.

-Doug