On Tue, Feb 24, 2015 at 11:42:20AM +0100, Daniel Vetter wrote: > On Tue, Feb 24, 2015 at 09:29:15AM +0000, Chris Wilson wrote: > > On Tue, Feb 24, 2015 at 11:14:30AM +0200, Imre Deak wrote: > > > Atm, it's possible that the interrupt handler is called when the device > > > is in D3 or some other low-power state. It can be due to another device > > > that is still in D0 state and shares the interrupt line with i915, or on > > > some platforms there could be spurious interrupts even without sharing > > > the interrupt line. The latter case was reported by Klaus Ethgen using a > > > Lenovo x61p machine (gen 4). He noticed this issue via a system > > > suspend/resume hang and bisected it to the following commit: > > > > > > commit e11aa362308f5de467ce355a2a2471321b15a35c > > > Author: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx> > > > Date: Wed Jun 18 09:52:55 2014 -0700 > > > > > > drm/i915: use runtime irq suspend/resume in freeze/thaw > > > > > > This is a problem, since in low-power states IIR will always read > > > 0xffffffff resulting in an endless IRQ servicing loop. > > > > > > Fix this by handling interrupts only when the driver explicitly enables > > > them and so it's guaranteed that the interrupt registers return a valid > > > value. > > > > > > Note that this issue existed even before the above commit, since during > > > runtime suspend/resume we never unregistered the handler. > > > > > > v2: > > > - clarify the purpose of smp_mb() vs. synchronize_irq() in the > > > code comment (Chris) > > > > > > v3: > > > - no need for an explicit smp_mb(), we can assume that synchronize_irq() > > > and the mmio read/writes in the install hooks provide for this (Daniel) > > > - remove code comment as the remaining synchronize_irq() is self > > > explanatory (Daniel) > > > > Why make the assumption though? I thought the comments documenting the > > interaction between the irq enablements, the irq handler and shared > > interrupts was beneficial. At the very least the assumption should be > > made explicit through comments in the code - because I am not convinced > > that a cached write will be flushed by an uncached write to another area > > of memory. In particular, note that on the gen most troubled by rogue > > irqs (gen4), we do not have any memory barriers in the mmio paths. > > The synchronize_irq is a fairly massive barrier and I've figured the name > is descriptive enough to make clear what's going on. At least I've felt > any comment on top would be redundant. > > Also the hard rule for adding comments to explicit barriers is mostly a > reminder that you always need barriers on both sides, and the comment > then must explain where the other side is in the code. Imo with > synchronize_irq it's clear that the other side is the irq handler already. > > What do you want to clarify on top of that in the comment? The smp_mb() was being applied to the enable path, not the disable + sync_irq path. Also we now have a large body of lore regarding interrupt issues on gen4 but we haven't been documenting any of it. At the bare minimum we need to explain why we call sync_irq and how that interacts with the intel_irqs_enabled() checks and shared interrupts. Even perhaps adding intel_runtime_pm_enabled_irq() as a wrapper for the irq handlers to cross-reference against the runtime_pm tricks - otherwise, there is a good likelihood that in the years to come, we might forget how we end up inside a disabled interrupt handler and why we are not allowed to touch the hardware at that point. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx