On Wed, Nov 23, 2011 at 03:03:43PM +0000, David Woodhouse wrote: > On Wed, 2011-11-23 at 15:39 +0100, Daniel Vetter wrote: > > At least for the dmar+gfx+semaphores hang I can reproduce, just disabling > > dmar with intel_iommu=igfx_off is not good enough and iirc the same holds > > for the dmar+rc6 hangs reported. > > Um... let me restate that for clarity (and partly for Rajesh's benefit). > > The DMAR associated with the integrated graphics is *disabled*. > Turned off. Not active. Ever. > > You have a problem when you enable the *other* DMAR units in the system, > which should not be affecting the graphics device in any way. > > When you do this, you see 'hangs' with semaphores and RC6. Is there a > better description of these 'hangs' somewhere? Is the hardware > completely locked? > > These hangs go away when you disable the DMAR units. Again, that is the > *other* DMAR units in the system that have nothing to do with graphics. > > While I'm getting quite used to DMAR-related errata, this one does make > me stop and think 'wtf?'. It just seems so incongruous that disabling an > *unrelated* IOMMU would make the problem go away, and it makes me wonder > if it's actually a timing-related issue which is always there, but > something about the use of DMAR for network/disk/etc. makes it more > likely to trigger? > > We definitely need the hardware folks to get to the bottom of this one. Ok, let me document the recipe I use to hang my box here. It's about the dmar+semaphores hang I can reproduce, so might be slightly different in the actual cause than the dmar+rc6 bug (for that one we only have bug reports talking about hard freezing requiring power cycling). - Grab a GT2+ mobile snb (both my and the only other reporters machine fits this, so maybe it matters). pci rev 09 (i.e. first production silicon). - Install fc15 with the kde4 spin. I can't reproduce it with any other userspace than kde4. - Grab latest d-i-f from Keith and latest userspace graphics code (to avoid hitting any other snb hangs we've tracked down meanwhile). - Compile kernel with dmar and enable VT-d in the bios. - Login into the systems with gdm, the machine usually dies within a few seconds (while kde4 loads). If that's not good enough, a few minutes of light desktop usage will kill it. - Wait 2 minutes for the stuck-in-atomic detection logic to kick in and grab the backtrace over netconsole. Notice that the kernel is stuck trying to flush the dmar tlb cache (that's how I managed to track it down to a dmar interaction). Backtrace almost identical to the dmar issue on ilk. I've lost the backtrace, if you want I can regrab it. Things I've tried that don't work around the issue: - Disable dmar for the igfx with intel_iommu=igfx_off - Apply the ilk workaround (i.e. synchronous dmar tlb flushes + gpu idling while flushing). Things that work: - Disabling semaphores. - Disabling dmar in either the bios or on the cmdline with intel_iommu=off All reporters that tried confirmed that igfx_off is not good enough, only fully disabling dmar (for both the semaphores and the rc6 related hangs). Things that look interesting: - ppgtt support (i.e. using per-proces pagetables on the gfx instead of the global gtt) seems to paper over the issue for the original reporter of the semaphore related hangs. Unfortunately not for me, gpu still hangs (but doesn't take down the entire system with it). I've not yet investigated this one closely. Fyi, the windows driver uses ppgtt unconditionally on snb. Also, ppgtt seems to have no effect for at least one report of dmar related rc6 hangs. Cheers, Daniel -- Daniel Vetter Mail: daniel at ffwll.ch Mobile: +41 (0)79 365 57 48