Re: [PATCH] ad_sigma_delta: fix race between IRQ and completion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 24 Dec 2022 15:31:58 +1300
Daniel Beer <dlbeer@xxxxxxxxx> wrote:

> On Fri, Dec 23, 2022 at 04:16:59PM +0000, Jonathan Cameron wrote:
> > > ad_sigma_delta waits for a conversion which terminates with the firing
> > > of a one-shot IRQ handler. In this handler, the interrupt is disabled
> > > and a completion is set.
> > > 
> > > Meanwhile, the thread that initiated the conversion is waiting on the
> > > completion to know when the conversion happened. If this wait times out,
> > > the conversion is aborted and IRQs are disabled. But the IRQ may fire
> > > anyway between the time the completion wait times out and the disabling
> > > of interrupts. If this occurs, we get a double-disabled interrupt.  
> > 
> > Ouch and good work tracking it down.  just to check, did you see this
> > bug happen in the wild or spotted by code inspection?  
> 
> Hi Jonathan,
> 
> Thanks for reviewing. It was by inspection -- I'd originally thought
> about it and fixed in in a similar way in this patch:
> 
>     https://lore.kernel.org/all/61dd3e0c.1c69fb81.cea15.8d98@xxxxxxxxxxxxx/
> 
> But since that's not applied, I thought I'd better put together a
> separate fix for the time being.
> 
> > Given that timeout generally indicates hardware failure, I'm not sure
> > how critical this is to fix.  
> 
> Probably not very critical. I think you'd have to be pretty unlucky to
> encounter it.
> 
> > Is this fix sufficient?  If the interrupt is being handled on a different
> > CPU to the caller of this function, I think we can still race enough that
> > this fails to fix it up.  Might need a spinlock to prevent that.
> > 
> >   CPU 0                                        CPU 1
> > ad_sd_data_rdy_trig_poll()               ad_sd_wait_and_disable()
> >                                        
> >                                          //wait_for_completion ends
> > 					
> > Interrupt
> >                                           disable_irq()
> > 					  if (sigma-delta->irq_dis) !true	
> > 					  else
> > 						sigma_delta->irq_dis = true
> > 
> > disable_irq_nosync(irq)
> > sigma_delta->irq_dis = true;
> > 
> > So we still end up with a doubly disabled irq.  Add a spinlock to make the
> > disable and the setting of sigma_delta->irq_dis atomic then it should all be fine.                  
> 
> My understanding is that the suffix-less version of disable_irq would
> wait for all running handlers on other CPUs (i.e.
> ad_sd_data_rdy_trig_poll) to finish before proceeding, which would
> prevent this from happening. Is that not the case?

Ah. I'd missed that - it is indeed documented as waiting for
pending irq handlers so that race doesn't exist.


> 
> But now that you mention it, there is another small problem: in the case
> where the conversion doesn't time out, the interrupt handler will call
> complete() and then perform some operations on the struct
> ad_sigma_delta.
> 
> This is always ok on a single processor, but if there are multiple CPUs
> there is possibly a brief period where both the interrupt handler and
> the waiting thread are accessing the ad_sigma_delta struct without
> synchronization between them.
> 
> Not sure if that's really a problem in practice, but I think an easy way
> to rule it out would just be to move the complete() call to the bottom
> of the handler and make sure it doesn't touch the structure again after
> that.

Hmm. At first glance, you are correct that moving completion later probably
makes sense. There may be some subtleties in that ordering though so I definitely want
some feedback from those more familiar with this driver than I am before
taking this patch or that further change.

Jonathan


> 
> Cheers,
> Daniel
> 




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Input]     [Linux Kernel]     [Linux SCSI]     [X.org]

  Powered by Linux