On Mon, 25 Mar 2024 17:16:16 +0000 Liviu Dudau <liviu.dudau@xxxxxxx> wrote: > On Mon, Mar 25, 2024 at 02:57:04PM +0100, Boris Brezillon wrote: > > Make sure we set suspended=true last to avoid generating an irq storm > > in the unlikely case where an IRQ happens between the suspended=true > > assignment and the _INT_MASK update. > > > > v2: > > - New patch > > > > Reported-by: Steven Price <steven.price@xxxxxxx> > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > > --- > > drivers/gpu/drm/panthor/panthor_device.h | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h > > index 7ee4987a3796..3a930a368ae1 100644 > > --- a/drivers/gpu/drm/panthor/panthor_device.h > > +++ b/drivers/gpu/drm/panthor/panthor_device.h > > @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq) > > { \ > > int cookie; \ > > \ > > - atomic_set(&pirq->suspended, true); \ > > + pirq->mask = 0; \ > > I think you might still have a race between _irq_suspend() and _irq_threaded_handler() where the > status will be zero due to pirq->mask being zero, so no interrupt will be cleared but they will > be masked (kind of the opposite problem to patch 3/3). Right, but I'm trying to find a case where this is an issue. Yes, we might lose events, but at the same time, when _irq_suspend() is called, we are supposed to be idle, so all this mask=0 assignment does is speed-up the synchronization with the irq-thread. If there's anything we need to be done before suspending the IRQ, this should really use its own synchronization model. > > I'm starting to think that pirq->mask should be local to _irq_threaded_handler() and not be messed > with in the other functions. It kinda is, as we don't modify panthor_irq::mask outside the suspend/resume (and now unplug) path, and each of these accesses has a reason to exist: - in the resume path, we know all IRQs are masked, and we reset the SW-side mask to the interrupts we want to accept before updating _INT_MASK. No risk of race in that one - in the unplug path, I don't think we care about unhandled interrupts, because the device will become unusable after that point, so updating the panthor_irq::mask early and losing events should be okay. - the suspend case has been described above. As explained, I don't think it matters if we lose events there, because really, if there's any synchronization needed, it should have happened explicitly before _irq_suspend() is called. The synchronize_irq() we have is just here to make sure there's nothing accessing registers when we turn the device clk/power-domain off. > > > \ > > if (drm_dev_enter(&pirq->ptdev->base, &cookie)) { \ > > gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0); \ > > If you move the line above before the if condition, do you still need patch 3/3? The whole point of the drm_dev_enter/exit() section was to prevent access to registers after the device has been unplugged, so, if I move the gpu_write() outside of this block, I'd rather drop the entire drm_dev_enter/exit() section (both here and in _irq_resume()). That should be safe actually, as I don't expect the PM hooks or the reset handler to be called after the device and its resource have been removed, and those are the two only paths where _irq_suspend/resume() can be called.