Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr

Doug Anderson <dianders@xxxxxxxxxxxx> · Thu, 3 Dec 2020 08:40:46 -0800

Hi,

On Wed, Dec 2, 2020 at 11:45 PM Roja Rani Yarubandi
<rojay@xxxxxxxxxxxxxx> wrote:
>
> Here, there is a chance of race condition occurrence which leads to
> NULL pointer dereference with struct spi_geni_master member 'cur_xfer'
> between setup_fifo_xfer() and handle_fifo_timeout() functions.
>
> Fix this race condition with guarding the 'cur_xfer' where it gets updated,
> with spin_lock_irq/spin_unlock_irq in setup_fifo_xfer() as we do in
> handle_fifo_timeout() function.
>
> Call trace:
>  geni_spi_isr+0x114/0x34c
>  __handle_irq_event_percpu+0xe0/0x23c
>  handle_irq_event_percpu+0x34/0x8c
>  handle_irq_event+0x48/0x94
>  handle_fasteoi_irq+0xd0/0x140
>  __handle_domain_irq+0x8c/0xcc
>  gic_handle_irq+0x114/0x1dc
>  el1_irq+0xcc/0x180
>  geni_spi a80000.spi: Failed to cancel/abort m_cmd
>  dev_watchdog+0x348/0x354
>  call_timer_fn+0xc4/0x220
>  __run_timers+0x228/0x2d4
>  spi_master spi6: failed to transfer one message from queue
>  run_timer_softirq+0x24/0x44
>  __do_softirq+0x16c/0x344
>  irq_exit+0xa8/0xac
>  __handle_domain_irq+0x94/0xcc
>  gic_handle_irq+0x114/0x1dc
>  el1_irq+0xcc/0x180
>  cpuidle_enter_state+0xf8/0x204
>  cpuidle_enter+0x38/0x4c
>  cros-ec-spi spi6.0: spi transfer failed: -110

Thanks for the patch!

I'm not convinced about your explanation, though.

The overall assumption in setup_fifo_xfer(), as indicated by the big
comment at the start of the function, is that once the spin_lock_irq()
/ spin_unlock_irq() dance happens at the start of that function that
no more interrupts will come in until the later lock.  If that
assumption is not true then we need to look at the whole function, not
just bandaid where we're setting "cur_xfer".  Specifically if
transfers are still happening all the other things that
setup_fifo_xfer() is doing are also all bad ideas.

I guess the first question is: why are you even getting a timeout?
SPI is clocked by the master and there's nothing the slave can do to
delay a transfer.  If you're getting a timeout in the first place it
means there's something pretty wrong.  Maybe that would be the first
thing to look at.  I guess the most likely case is that something else
in the system is somehow blocking our interrupt handler from running?

The second question is: what can we do about the "Failed to
cancel/abort m_cmd" message above.  I guess if our interrupt handler
is blocked that would explain why the cancel and abort didn't work.
It's pretty hard to figure out what to do to recover at this point.
When the function exits it assumes that the transfer has been canceled
/ aborted, but it hasn't.  This seems like it has the ability to throw
everything for a loop.  If an interrupt can come in for the previous
transfer after handle_fifo_timeout() exits it could cause all sorts of
problems, right?

It seems like one thing that might help would be to add this to the
start of handle_fifo_timeout():
  dev_warn(mas->dev, "Timeout; synchronizing IRQ\n");
  synchronize_irq(mas->irq);

Maybe also add a synchronize_irq() at the end of the function too?  If
my theory is correct and the whole problem here is that our interrupt
is blocked, I think that will make us block, which is probably the
best we can do and better than just returning and getting the
interrupt later.

That all being said, looking at all this makes me believe that the
NULL dereference is because of commit 7ba9bdcb91f6 ("spi:
spi-geni-qcom: Don't keep a local state variable").  There, I added
"mas->cur_xfer = NULL" into handle_fifo_timeout().  That should be the
right thing to do (without that we might have been reading bogus data
/ writing to memory we shouldn't), but I think geni_spi_handle_rx()
and geni_spi_handle_tx() don't handle it properly.  I think we'll
dereference it without checking.  :(  It would probably be good to fix
this too?  I would guess that if "mas->cur_xfer" is NULL then
geni_spi_handle_rx() should read all data in the FIFO and throw it
away and geni_spi_handle_tx() should set SE_GENI_TX_WATERMARK_REG to
0.  NOTE: I _think_ that with the synchronize_irq() I'm suggesting
above we'll avoid this case, but it never hurts to be defensive.

Does that all make sense?  So the summary is that instead of your patch:

1. Add synchronize_irq() at the start and end of
handle_fifo_timeout().  Not under lock.

2. In geni_spi_handle_rx(), check for NULL "mas->cur_xfer".  Read all
data in the FIFO (don't cap at rx_rem_bytes), but throw it away.

3. In geni_spi_handle_tx(), check for NULL "mas->cur_xfer".  Don't
write any data.  Just write 0 to SE_GENI_TX_WATERMARK_REG.

I think #1 is the real fix, but #2 and #3 will avoid crashes in case
there's another bug somewhere.

-Doug