On Thu, Oct 17, 2013 at 09:41:09AM +0200, Takashi Iwai wrote: > At Wed, 16 Oct 2013 18:27:33 +0100, > Chris Wilson wrote: > > > > On Wed, Oct 16, 2013 at 10:06:27AM -0700, Ben Widawsky wrote: > > > On Wed, Oct 16, 2013 at 05:58:31PM +0100, Chris Wilson wrote: > > > > So clearing the valid bit should result in the GPU reporting errors for > > > > delayed accesses, but none were reported? > > > > > > So I can't actually reproduce the problem for some reason. Paulo will > > > need to answer. One theory is the fault information is lost on suspend. > > > > > > The original patch put faults both in suspend, and resume. After this, I > > > asked Paulo to wedge the GPU, and there I saw faults. > > > > If we can capture the error, and it should be very possible to do so, we > > should be able to pinpoint the cause quite quickly. If it is just deferred > > writes, it should also be a problem across module unload - which should > > be easier for getting debug information out. > > The bug is only about S4, thus it's not so easy to capture anything in > the resume kernel, as all lost after transition to the restored > kernel. > > BTW, I also suspect that the similar problem might still happen in > other cases, e.g. via kexec even with this patch. How are devices idled (or suspended) prior to hibernate resume or kexec? >From my reading, i915_drm_freeze() should be called before the resume image is executed. What we can do is to make the first action of i915_driver_unload() be i915_drm_freeze(), then clear the PTE valid bits and wait a second or two for a GPU fault before proceeding with an unload. By doing that we can debug our suspend paths - all that remains is the possibility of rogue hardware state. And that should show up by breaking module load. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx