Re: [PATCH 0/5] Fix mode1 reset test failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The series is

Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx>


On 2025-02-26 12:14, Philip Yang wrote:
mode1 reset test running with compute applications trigger many different
failures, such as machine reboot, kernel crash with general protection fault,
NULL pointer access or cpu page fault etc from random calling backtrace.

With KASAN and slub_debug enabled kernel, we capture slub left-redzone
overwrtten warning, but no KASAN warning before crash. This can confirm there is
system memory overwritten from GPU, not from CPU.

Change hanghws test to evict user queues first, then do mode1 reset test, no
crash anymore, this can confirm the system memory overwritten by user queues.
Because the user queues keep using GPU while KFD cleanup worker free the
outstanding BOs, the freed system memory is allocated and reused to create job,
resource. Then the data structure is corrupted by user queue and cause crash.

The fix is in KFD cleanup worker, after evicting all user queues, flush
reset_domain->wq to ensure ongoing mode1 reset is done or user queues are
evicted, then free outstanding BOs.

Philip Yang (5):
   drm/amdkfd: Remove kfd_process_hw_exception worker
   drm/amdkfd: KFD release_work possible circular locking
   drm/amdkfd: Fix mode1 reset crash issue
   drm/amdkfd: Fix pqm_destroy_queue race with GPU reset
   drm/amdkfd: debugfs hang_hws skip GPU with MES

  drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  5 +++
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 11 +------
  .../drm/amd/amdkfd/kfd_device_queue_manager.h |  1 -
  drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 33 ++++++++++++++-----
  .../amd/amdkfd/kfd_process_queue_manager.c    |  2 +-
  5 files changed, 32 insertions(+), 20 deletions(-)




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux