On Fri, Dec 21, 2012 at 4:56 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > This thing isn't repeatable, and it can go days without happening, but > it has happened several times now over the last several weeks, to the > point where it is very annoying. > > I get: > > [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung > [drm] capturing error event; look for more information in > /debug/dri/0/i915_error_state > [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung > [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! > [drm:i915_reset] *ERROR* Failed to reset chip. > > and then I need to reboot, because restarting X just causes it to be > slow and unaccelerated. > > I'm attaching the i915_error_state thing, although I suspect it's > useless, since I got it after an X restart. But maybe it shows why > even the X restart doesn't do anything. > > This is a Westmere setup: it's a > > Intel(R) Core(TM) i5 CPU 670 @ 3.47GHz > > and dmesg doesn't have anything interesting in it at all. Running > up-to-date Fedora 17. Yeah, looks like the ilk death which somehow became much easier to hit in 3.7. Bug to track the various reporters is at https://bugs.freedesktop.org/show_bug.cgi?id=55984 > Any ideas about anything in particular I can do to trigger it and help > debug it? There's usually nothing special going on when this happens. > This last one was during a kernel build, but the screen was actually > locked (and I don't even have a fancy screensaver, it's just a blank > black screen for me). As far as we know it requires light gpu load (desktop, 3d compositor seems to help to hit it) with some form of memory/io load. Kernel compiles, massive svn checkouts or just filling the pagecache with crap and cleaning it up again seems to be good at triggering it. Progress in debugging has been extremely slow, especially after we've disabled rc6 on ilk (attempted an enable in the 3.7 by default of that) - without rc6 none of the local machines we devs have here can reproduce the bug any more, and the rc6 hangs have a slightly different hang signature, so decent chance it's a different bug. We're suspecting it's an old one, just made somehow much easier to hit with the changes in 3.7. Chris has some good evidence already that the hw has stricter alignment requirements than what we implement, but patches are only just now gathering testing feedback. And after a few weeks of searching for the loose piece of ducttape so that we can reattach it I've finally found a slight change in our shrinker behaviour which might mitigate things. Patch is in testing since a few days and hasn't blown up yet - which is a record thus far. So I'm still hopeful that I can forge this quick hack into a real patch to reapply the lost ducttape. > Other times, it's just normal desktop. Quite often it is during a > kernel compile, with loads in the 30+ range, so maybe it's triggered > by high loads resulting in some program not being hugely responsive > (maybe losing the drm state?) but quite frankly, I do a *lot* of > kernel compiles especially during the merge window, so the "it > happened during a kernel compile" is not necessarily indicative of any > deeper causation - it's just that compiling kernels is what I do ;) > > I've gotten hangcheck timers over the years, but it really seems to > have been getting worse. Please help. If the reset worked and it would > clear up after I just logged out and back in again, that would already > be a big thing. gpu reset state transitions should be much less racy in 3.8 than 3.7, but that doesn't help if the gpu hangs to quickly. You can try to resurrect it in that case with # echo 0 > /sys/kernel/debug/dri/0/i915_wedged We can't really disable that "hanging too fast" check, since it helps tremendously in preventing gpu hangs from spiralling into certain death of the entire system, and so is important for debugging. Other workarounds to mitigate things are: - Don't use a 3d compositor, the ddx should transparently fall over to sw rendering when the gpu is permanently declared dead. - Using SNA accelaration instead of UXA in the ddx seems to migitate the hangs a lot (Option "AccelMethod" "SNA" in an xorg.conf snippet) - we suspect it's due to the different layout of the indirect state objects. UXA has an entire tree of those, SNA packs them into one single gem object, which seems to tremendously reduce changes that the gpu/kernel barfs on them and so hangs the gpu. Oh, and in case you get the urge to flame your drm/i915 maintainer to crips over this disaster (it's one imo), I'll be offline the next 2 weeks, snowboarding in the swiss alps ;-) Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel