On 11 February 2018 at 09:38, Lukas Wunner <lukas@xxxxxxxxx> wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > the poll worker invokes the DRM drivers' ->detect callbacks, which call > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > with the intention of runtime resuming the device afterwards. The result > is a circular wait between poll worker and autosuspend worker. > > I'm seeing this deadlock so often it's not funny anymore. I've also found > 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and > [4/5].) > > One theoretical solution would be to add a flag to the ->detect callback > to tell it that it's running in the poll worker's context. In that case > it doesn't need to call pm_runtime_get_sync() because the poll worker is > only enabled while runtime active and we know that ->runtime_suspend > waits for it to finish. But this approach would require touching every > single DRM driver's ->detect hook. Moreover the ->detect hook is called > from numerous other places, both in the DRM library as well as in each > driver, making it necessary to trace every possible call chain and check > if it's coming from the poll worker or not. I've tried to do that for > nouveau (see the call sites listed in the commit message of patch [3/5]) > and concluded that this approach is a nightmare to implement. > > Another reason for the infeasibility of this approach is that ->detect > is called from within the DRM library without driver involvement, e.g. > via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would > have to be called by the DRM library, but the DRM library is supposed to > stay generic, wheras runtime PM is driver-specific. > > So instead, I've come up with this much simpler solution which gleans > from the current task's flags if it's a workqueue worker, retrieves the > work_struct and checks if it's the poll worker. All that's needed is > one small helper in the workqueue code and another small helper in the > DRM library. This solution lends itself nicely to stable-backporting. > > The patches for radeon and amdgpu are compile-tested only, I only have a > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > This ensures that the poll worker runs after ->runtime_suspend has begun. > Wait 12 sec after the GPU has begun runtime suspend, then check > /sys/bus/pci/devices/0000:01:00.0/power/runtime_status. Without this > series, the status will be stuck at "suspending" and you'll get hung task > errors in dmesg after a few minutes. > > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +++++++++++++------- > drivers/gpu/drm/drm_probe_helper.c | 14 +++++ > drivers/gpu/drm/nouveau/nouveau_connector.c | 18 +++++-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 +++++++++++++++++--------- > include/drm/drm_crtc_helper.h | 1 + > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 16 ++++++ > 7 files changed, 132 insertions(+), 50 deletions(-) > > -- > 2.15.1 > > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/dri-devel I wasn't quite sure where to add that msleep. I've tested the patches as is on top of agd5f's wip branch without ill effects I've had a radeon and now a amdgpu PRIME setup and don't believe I've ever seen this issue If you could pop a patch together for the msleep I'll give it a test on amdgpu Cheers Mike _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel