It is useful to maintain error context when debugging SW/FW issues. We introduce amdgpu_device_halt() for this purpose. It will bring hardware to a kind of halt state, so that no one can touch it any more. Compare to a simple hang, the system will keep stable at least for SSH access. Then it should be trivial to inspect the hardware state and see what's going on. Suggested-by: Christian Koenig <christian.koenig@xxxxxxx> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++ 2 files changed, 41 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c5cfe2926ca1..3f5f8f62aa5c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev, void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev, struct amdgpu_ring *ring); +void amdgpu_device_halt(struct amdgpu_device *adev); + /* atpx handler */ #if defined(CONFIG_VGA_SWITCHEROO) void amdgpu_register_atpx_handler(void); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index a1c14466f23d..62216627cc83 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev, amdgpu_asic_invalidate_hdp(adev, ring); } + +/** + * amdgpu_device_halt() - bring hardware to some kind of halt state + * + * @adev: amdgpu_device pointer + * + * Bring hardware to some kind of halt state so that no one can touch it + * any more. It will help to maintain error context when error occurred. + * Compare to a simple hang, the system will keep stable at least for SSH + * access. Then it should be trivial to inspect the hardware state and + * see what's going on. Implemented as following: + * + * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc), + * clears all CPU mappings to device, disallows remappings through page faults + * 2. amdgpu_irq_disable_all() disables all interrupts + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings + * 5. pci_disable_device() and pci_wait_for_pending_transaction() + * flush any in flight DMA operations + * 6. set adev->no_hw_access to true + */ +void amdgpu_device_halt(struct amdgpu_device *adev) +{ + struct pci_dev *pdev = adev->pdev; + struct drm_device *ddev = &adev->ddev; + + drm_dev_unplug(ddev); + + amdgpu_irq_disable_all(adev); + + amdgpu_fence_driver_hw_fini(adev); + + amdgpu_device_unmap_mmio(adev); + + pci_disable_device(pdev); + pci_wait_for_pending_transaction(pdev); + + adev->no_hw_access = true; +} -- 2.25.1