At Thu, 17 Oct 2013 10:24:07 +0100, Chris Wilson wrote: > > On Thu, Oct 17, 2013 at 09:41:09AM +0200, Takashi Iwai wrote: > > At Wed, 16 Oct 2013 18:27:33 +0100, > > Chris Wilson wrote: > > > > > > On Wed, Oct 16, 2013 at 10:06:27AM -0700, Ben Widawsky wrote: > > > > On Wed, Oct 16, 2013 at 05:58:31PM +0100, Chris Wilson wrote: > > > > > So clearing the valid bit should result in the GPU reporting errors for > > > > > delayed accesses, but none were reported? > > > > > > > > So I can't actually reproduce the problem for some reason. Paulo will > > > > need to answer. One theory is the fault information is lost on suspend. > > > > > > > > The original patch put faults both in suspend, and resume. After this, I > > > > asked Paulo to wedge the GPU, and there I saw faults. > > > > > > If we can capture the error, and it should be very possible to do so, we > > > should be able to pinpoint the cause quite quickly. If it is just deferred > > > writes, it should also be a problem across module unload - which should > > > be easier for getting debug information out. > > > > The bug is only about S4, thus it's not so easy to capture anything in > > the resume kernel, as all lost after transition to the restored > > kernel. > > > > BTW, I also suspect that the similar problem might still happen in > > other cases, e.g. via kexec even with this patch. > > How are devices idled (or suspended) prior to hibernate resume or kexec? > >From my reading, i915_drm_freeze() should be called before the resume > image is executed. I also didn't follow the complete (and complex) flow, but from my understanding, S4 case: hibernation_restore() in kernel/power/hibernate.c calls dpm_suspend_start(PMSG_QUIESCE), which invokes pm->freeze in the end. Since there is no pm->freeze_noirq, dpm_suspend_end(PMSG_QUIESCE) in resume_target_kernel() shouldn't matter. kexec case: it's usually shutdown ops called from kernel_restart_prepare() -> device_shutdown(). So, it's same as the normal shutdown. When KEXEC_PRESERVE_CONTEXT flag is set (where it works like suspend/resume), dpm_suspend_start(PMSG_FREEZE) will be called, which again invokes pm->freeze. i915 driver has no shutdown ops, and it's good so (we'd like to see the messages), but this means the device is still active at the normal kexec until the very latest stage, I'm afraid. > What we can do is to make the first action of > i915_driver_unload() be i915_drm_freeze(), then clear the PTE valid > bits and wait a second or two for a GPU fault before proceeding with an > unload. By doing that we can debug our suspend paths - all that remains > is the possibility of rogue hardware state. And that should show up by > breaking module load. Well, I somehow think the problem happens at transition to the restored image, where we have completely different memory maps from the boot kernel and this leads to memory corruption in /proc dcache or such. With unload / reload module case, the rest memory is preserved, thus it's a fairly different situation. Of course, I'm not against testing this at all. Just trying to understand what's going on... thanks, Takashi _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx