On Sat, 24 Dec 2022 15:31:58 +1300 Daniel Beer <dlbeer@xxxxxxxxx> wrote: > On Fri, Dec 23, 2022 at 04:16:59PM +0000, Jonathan Cameron wrote: > > > ad_sigma_delta waits for a conversion which terminates with the firing > > > of a one-shot IRQ handler. In this handler, the interrupt is disabled > > > and a completion is set. > > > > > > Meanwhile, the thread that initiated the conversion is waiting on the > > > completion to know when the conversion happened. If this wait times out, > > > the conversion is aborted and IRQs are disabled. But the IRQ may fire > > > anyway between the time the completion wait times out and the disabling > > > of interrupts. If this occurs, we get a double-disabled interrupt. > > > > Ouch and good work tracking it down. just to check, did you see this > > bug happen in the wild or spotted by code inspection? > > Hi Jonathan, > > Thanks for reviewing. It was by inspection -- I'd originally thought > about it and fixed in in a similar way in this patch: > > https://lore.kernel.org/all/61dd3e0c.1c69fb81.cea15.8d98@xxxxxxxxxxxxx/ > > But since that's not applied, I thought I'd better put together a > separate fix for the time being. > > > Given that timeout generally indicates hardware failure, I'm not sure > > how critical this is to fix. > > Probably not very critical. I think you'd have to be pretty unlucky to > encounter it. > > > Is this fix sufficient? If the interrupt is being handled on a different > > CPU to the caller of this function, I think we can still race enough that > > this fails to fix it up. Might need a spinlock to prevent that. > > > > CPU 0 CPU 1 > > ad_sd_data_rdy_trig_poll() ad_sd_wait_and_disable() > > > > //wait_for_completion ends > > > > Interrupt > > disable_irq() > > if (sigma-delta->irq_dis) !true > > else > > sigma_delta->irq_dis = true > > > > disable_irq_nosync(irq) > > sigma_delta->irq_dis = true; > > > > So we still end up with a doubly disabled irq. Add a spinlock to make the > > disable and the setting of sigma_delta->irq_dis atomic then it should all be fine. > > My understanding is that the suffix-less version of disable_irq would > wait for all running handlers on other CPUs (i.e. > ad_sd_data_rdy_trig_poll) to finish before proceeding, which would > prevent this from happening. Is that not the case? Ah. I'd missed that - it is indeed documented as waiting for pending irq handlers so that race doesn't exist. > > But now that you mention it, there is another small problem: in the case > where the conversion doesn't time out, the interrupt handler will call > complete() and then perform some operations on the struct > ad_sigma_delta. > > This is always ok on a single processor, but if there are multiple CPUs > there is possibly a brief period where both the interrupt handler and > the waiting thread are accessing the ad_sigma_delta struct without > synchronization between them. > > Not sure if that's really a problem in practice, but I think an easy way > to rule it out would just be to move the complete() call to the bottom > of the handler and make sure it doesn't touch the structure again after > that. Hmm. At first glance, you are correct that moving completion later probably makes sense. There may be some subtleties in that ordering though so I definitely want some feedback from those more familiar with this driver than I am before taking this patch or that further change. Jonathan > > Cheers, > Daniel >