Recently, I debugged a few device crashes which occured during recovery after a hangcheck timeout. It looks like there are a few things we can do to improve our chance at a successful gpu recovery. First one is to ensure that CX GDSC collapses which clears the internal states in gpu's CX domain. First 5 patches tries to handle this. Rest of the patches are to ensure that few internal blocks like CP, GMU and GBIF are halted properly before proceeding for a snapshot followed by recovery. Also, handle 'prepare slumber' hfi failure correctly. These are A6x specific improvements. This series is rebased on top of [1] which based on linus's master branch. [1] https://patchwork.freedesktop.org/series/106860/ Changes in v3: - Use reset interface from gpucc driver to poll for cx gdsc collapse https://patchwork.freedesktop.org/series/106860/ - Use single pm refcount for all active submits Changes in v2: - Rebased on msm-next tip Akhil P Oommen (8): drm/msm: Remove unnecessary pm_runtime_get/put drm/msm: Take single rpm refcount on behalf of all submits drm/msm: Correct pm_runtime votes in recover worker drm/msm: Fix cx collapse issue during recovery drm/msm/a6xx: Ensure CX collapse during gpu recovery drm/msm/adreno: Remove a WARN() during runtime_suspend drm/msm/a6xx: Improve gpu recovery sequence drm/msm/a6xx: Handle GMU prepare-slumber hfi failure drivers/gpu/drm/msm/adreno/a6xx.xml.h | 4 ++ drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 83 +++++++++++++++++++----------- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 35 +++++++++++-- drivers/gpu/drm/msm/adreno/adreno_device.c | 7 --- drivers/gpu/drm/msm/msm_gpu.c | 21 +++++--- drivers/gpu/drm/msm/msm_gpu.h | 4 ++ drivers/gpu/drm/msm/msm_ringbuffer.c | 4 -- 7 files changed, 106 insertions(+), 52 deletions(-) -- 2.7.4