On Tue, 26 Mar 2024 09:58:17 +0000 Liviu Dudau <liviu.dudau@xxxxxxx> wrote: > On Mon, Mar 25, 2024 at 07:02:13PM +0100, Boris Brezillon wrote: > > On Mon, 25 Mar 2024 17:16:16 +0000 > > Liviu Dudau <liviu.dudau@xxxxxxx> wrote: > > > > > On Mon, Mar 25, 2024 at 02:57:04PM +0100, Boris Brezillon wrote: > > > > Make sure we set suspended=true last to avoid generating an irq storm > > > > in the unlikely case where an IRQ happens between the suspended=true > > > > assignment and the _INT_MASK update. > > > > > > > > v2: > > > > - New patch > > > > > > > > Reported-by: Steven Price <steven.price@xxxxxxx> > > > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > > > > --- > > > > drivers/gpu/drm/panthor/panthor_device.h | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/panthor/panthor_device.h b/drivers/gpu/drm/panthor/panthor_device.h > > > > index 7ee4987a3796..3a930a368ae1 100644 > > > > --- a/drivers/gpu/drm/panthor/panthor_device.h > > > > +++ b/drivers/gpu/drm/panthor/panthor_device.h > > > > @@ -325,7 +325,7 @@ static inline void panthor_ ## __name ## _irq_suspend(struct panthor_irq *pirq) > > > > { \ > > > > int cookie; \ > > > > \ > > > > - atomic_set(&pirq->suspended, true); \ > > > > + pirq->mask = 0; \ > > > > > > I think you might still have a race between _irq_suspend() and _irq_threaded_handler() where the > > > status will be zero due to pirq->mask being zero, so no interrupt will be cleared but they will > > > be masked (kind of the opposite problem to patch 3/3). > > > > Right, but I'm trying to find a case where this is an issue. Yes, we > > might lose events, but at the same time, when _irq_suspend() is called, > > we are supposed to be idle, so all this mask=0 assignment does is > > speed-up the synchronization with the irq-thread. If there's anything > > we need to be done before suspending the IRQ, this should really use > > its own synchronization model. > > Perf counter collection before going idle ;) That's typically a case of synchronization that should be done explicitly rather than relying on the synchronize_irq() we have in _irq_suspend(). > > I know that's not yet on your radar, but we need to keep it in mind to > test that scenario when we add support for keeping runtime PM alive > during perf counters dumping. It's definitely on our radar, but I don't think _irq_suspend() is the right way to handle perfcnt dump synchronization. IMHO, it deserves its own _perfcnt_suspend() function that's called before _fw_suspend() to make sure any in-flight perfcnt dumps get flushed before the JOB/FW irq suspend happens. > > > > > > > > > I'm starting to think that pirq->mask should be local to _irq_threaded_handler() and not be messed > > > with in the other functions. > > > > It kinda is, as we don't modify panthor_irq::mask outside the > > suspend/resume (and now unplug) path, and each of these accesses has a > > reason to exist: > > > > - in the resume path, we know all IRQs are masked, and we reset the > > SW-side mask to the interrupts we want to accept before updating > > _INT_MASK. No risk of race in that one > > - in the unplug path, I don't think we care about unhandled interrupts, > > because the device will become unusable after that point, so updating > > the panthor_irq::mask early and losing events should be okay. > > - the suspend case has been described above. As explained, I don't think > > it matters if we lose events there, because really, if there's any > > synchronization needed, it should have happened explicitly before > > _irq_suspend() is called. The synchronize_irq() we have is just here > > to make sure there's nothing accessing registers when we turn the > > device clk/power-domain off. > > > > > > > > > \ > > > > if (drm_dev_enter(&pirq->ptdev->base, &cookie)) { \ > > > > gpu_write(pirq->ptdev, __reg_prefix ## _INT_MASK, 0); \ > > > > > > If you move the line above before the if condition, do you still need patch 3/3? > > > > The whole point of the drm_dev_enter/exit() section was to prevent > > access to registers after the device has been unplugged, so, if I move > > the gpu_write() outside of this block, I'd rather drop the entire > > drm_dev_enter/exit() section (both here and in _irq_resume()). That > > should be safe actually, as I don't expect the PM hooks or the reset > > handler to be called after the device and its resource have been > > removed, and those are the two only paths where _irq_suspend/resume() > > can be called. > > Agree. It feels strange to care a lot about preventing access to registers > (as we should) and then going and adding the unplug function which replaces > the suspend in 3 out of 7 cases. New contributors might get confused about > which one to use in the future. Actually, irq_unplug() should be used in the _unplug() path, as the name implies, I don't think that's confusing :P. My point was more on the fact we probably don't need the dev_enter/exit() sections in the first place, so why introduce a new helper if we can make the _irq_suspend() helper simpler and re-usable for the _unplug() case? > > What we should do is rename suspend to unplug and move the drm_dev_enter/exit() > condition at the calling point for the cases where it makes sense. That'd be even more confusing IMO, why would you call an helper named _unplug() if all you want to do is temporarily suspend the IRQ?