mode1 reset test running with compute applications trigger many different failures, such as machine reboot, kernel crash with general protection fault, NULL pointer access or cpu page fault etc from random calling backtrace. With KASAN and slub_debug enabled kernel, we capture slub left-redzone overwrtten warning, but no KASAN warning before crash. This can confirm there is system memory overwritten from GPU, not from CPU. Change hanghws test to evict user queues first, then do mode1 reset test, no crash anymore, this can confirm the system memory overwritten by user queues. Because the user queues keep using GPU while KFD cleanup worker free the outstanding BOs, the freed system memory is allocated and reused to create job, resource. Then the data structure is corrupted by user queue and cause crash. The fix is in KFD cleanup worker, after evicting all user queues, flush reset_domain->wq to ensure ongoing mode1 reset is done or user queues are evicted, then free outstanding BOs. Philip Yang (5): drm/amdkfd: Remove kfd_process_hw_exception worker drm/amdkfd: KFD release_work possible circular locking drm/amdkfd: Fix mode1 reset crash issue drm/amdkfd: Fix pqm_destroy_queue race with GPU reset drm/amdkfd: debugfs hang_hws skip GPU with MES drivers/gpu/drm/amd/amdkfd/kfd_device.c | 5 +++ .../drm/amd/amdkfd/kfd_device_queue_manager.c | 11 +------ .../drm/amd/amdkfd/kfd_device_queue_manager.h | 1 - drivers/gpu/drm/amd/amdkfd/kfd_process.c | 33 ++++++++++++++----- .../amd/amdkfd/kfd_process_queue_manager.c | 2 +- 5 files changed, 32 insertions(+), 20 deletions(-) -- 2.47.1