Re: [PATCH v4] drm/i915: avoid processing spurious/shared interrupts in low-power states

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 24 Feb 2015 11:00:29 +0000

On Tue, Feb 24, 2015 at 11:42:20AM +0100, Daniel Vetter wrote:
> On Tue, Feb 24, 2015 at 09:29:15AM +0000, Chris Wilson wrote:
> > On Tue, Feb 24, 2015 at 11:14:30AM +0200, Imre Deak wrote:
> > > Atm, it's possible that the interrupt handler is called when the device
> > > is in D3 or some other low-power state. It can be due to another device
> > > that is still in D0 state and shares the interrupt line with i915, or on
> > > some platforms there could be spurious interrupts even without sharing
> > > the interrupt line. The latter case was reported by Klaus Ethgen using a
> > > Lenovo x61p machine (gen 4). He noticed this issue via a system
> > > suspend/resume hang and bisected it to the following commit:
> > > 
> > > commit e11aa362308f5de467ce355a2a2471321b15a35c
> > > Author: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>
> > > Date:   Wed Jun 18 09:52:55 2014 -0700
> > > 
> > >     drm/i915: use runtime irq suspend/resume in freeze/thaw
> > > 
> > > This is a problem, since in low-power states IIR will always read
> > > 0xffffffff resulting in an endless IRQ servicing loop.
> > > 
> > > Fix this by handling interrupts only when the driver explicitly enables
> > > them and so it's guaranteed that the interrupt registers return a valid
> > > value.
> > > 
> > > Note that this issue existed even before the above commit, since during
> > > runtime suspend/resume we never unregistered the handler.
> > > 
> > > v2:
> > > - clarify the purpose of smp_mb() vs. synchronize_irq() in the
> > >   code comment (Chris)
> > > 
> > > v3:
> > > - no need for an explicit smp_mb(), we can assume that synchronize_irq()
> > >   and the mmio read/writes in the install hooks provide for this (Daniel)
> > > - remove code comment as the remaining synchronize_irq() is self
> > >   explanatory (Daniel)
> > 
> > Why make the assumption though? I thought the comments documenting the
> > interaction between the irq enablements, the irq handler and shared
> > interrupts was beneficial. At the very least the assumption should be
> > made explicit through comments in the code - because I am not convinced
> > that a cached write will be flushed by an uncached write to another area
> > of memory. In particular, note that on the gen most troubled by rogue
> > irqs (gen4), we do not have any memory barriers in the mmio paths.
> 
> The synchronize_irq is a fairly massive barrier and I've figured the name
> is descriptive enough to make clear what's going on. At least I've felt
> any comment on top would be redundant.
> 
> Also the hard rule for adding comments to explicit barriers is mostly a
> reminder that you always need barriers on both sides, and the comment
> then must explain where the other side is in the code. Imo with
> synchronize_irq it's clear that the other side is the irq handler already.
> 
> What do you want to clarify on top of that in the comment?

The smp_mb() was being applied to the enable path, not the disable +
sync_irq path. Also we now have a large body of lore regarding interrupt
issues on gen4 but we haven't been documenting any of it.  At the bare
minimum we need to explain why we call sync_irq and how that interacts
with the intel_irqs_enabled() checks and shared interrupts.

Even perhaps adding intel_runtime_pm_enabled_irq() as a wrapper for the
irq handlers to cross-reference against the runtime_pm tricks -
otherwise, there is a good likelihood that in the years to come, we
might forget how we end up inside a disabled interrupt handler and why
we are not allowed to touch the hardware at that point.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx