Atm we can end up in the GPU reset deferred work in D3 state if the last runtime PM reference is dropped between detecting a hang/scheduling the work and executing the work. At least one such case I could trigger is the simulated reset via the i915_wedged debugfs entry. Fix this by disabling RPM before scheduling the work until the end of the work. v2: - Instead of getting/putting the RPM reference in the reset work itself, get it already before scheduling the work. By this we also prevent going to D3 before the work gets to run, in addition to making sure that we run the work itself in D0. (Ville, Daniel) Signed-off-by: Imre Deak <imre.deak@xxxxxxxxx> --- drivers/gpu/drm/i915/i915_dma.c | 8 +++++++- drivers/gpu/drm/i915/i915_irq.c | 21 ++++++++++++++++++++- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c index 0b38f88..6398280 100644 --- a/drivers/gpu/drm/i915/i915_dma.c +++ b/drivers/gpu/drm/i915/i915_dma.c @@ -1823,7 +1823,13 @@ int i915_driver_unload(struct drm_device *dev) /* Free error state after interrupts are fully disabled. */ del_timer_sync(&dev_priv->gpu_error.hangcheck_timer); - cancel_work_sync(&dev_priv->gpu_error.work); + if (!cancel_work_sync(&dev_priv->gpu_error.work)) + /* + * The following won't make any difference in the PM state, + * since RPM is disabled already, but do it still for + * consistency. + */ + intel_runtime_pm_put(dev_priv); i915_destroy_error_state(dev); if (dev->pdev->msi_enabled) diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index a651d0d..5e079d8 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2210,6 +2210,9 @@ static void i915_error_work_func(struct work_struct *work) */ i915_error_wake_up(dev_priv, true); } + + /* Drop the ref we took when scheduling this work. */ + intel_runtime_pm_put(dev_priv); } static void i915_report_and_clear_eir(struct drm_device *dev) @@ -2353,8 +2356,24 @@ void i915_handle_error(struct drm_device *dev, bool wedged, * state of outstanding pagelips). Hence it must not be run on our own * dev-priv->wq work queue for otherwise the flush_work in the pageflip * code will deadlock. + * + * It's guaranteed that here we are in D0 state, since we can only get + * here via one of the following paths: + * - From an IRQ handler's error detection. -> The driver must make + * sure that IRQs are unmasked only while holding an RPM ref. + * - From hang-check due to a blocked request. -> The request holds an + * RPM ref, that's only released in i915_gpu_idle() which in turn + * won't be called until the request is finished. + * - From hang-check due to a flip hang. -> We have an RPM ref because + * of the active modeset. + * - From debugfs i915_wedged_set(). -> The caller takes an explicit + * RPM ref. + * Take here an atomic RPM ref still to make sure that we don't + * re-enter D3 until the error work gets to run and completes the + * reset. */ - schedule_work(&dev_priv->gpu_error.work); + if (schedule_work(&dev_priv->gpu_error.work)) + intel_runtime_pm_get_noresume(dev_priv); } static void __always_unused i915_pageflip_stall_check(struct drm_device *dev, int pipe) -- 1.8.4 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx