Re: [PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device

Felix Kuehling <felix.kuehling@xxxxxxx> · Tue, 26 Mar 2024 11:01:32 -0400

On 2024-03-26 10:53, Philip Yang wrote:


On 2024-03-25 14:45, Felix Kuehling wrote:
On 2024-03-22 15:57, Zhigang Luo wrote:
it will cause page fault after device recovered if there is a 
process running.

Signed-off-by: Zhigang Luo <Zhigang.Luo@xxxxxxx>
Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 70261eb9b0bb..2867e9186e44 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4974,6 +4974,8 @@ static int amdgpu_device_reset_sriov(struct 
amdgpu_device *adev,
  retry:
      amdgpu_amdkfd_pre_reset(adev);
  +    amdgpu_amdkfd_wait_no_process_running(adev);
+

This waits for the processes to be terminated. What would cause the 
processes to be terminated? Why do the processes need to be 
terminated? Isn't it enough if the processes are removed from the 
runlist in pre-reset, so they can no longer execute on the GPU?

mode 1 reset on SRIOV is much faster then BM, kgd2kfd_pre_reset sends 
GPU reset event to user space, don't remove queues from the runlist, 
after mode1 reset is done, there is queue still running and generate 
vm fault because the GPU page table is gone.

I think seeing a page fault during the reset is not a problem. Seeing a 
page fault after the reset would be a bug. The process should not be on 
the runlist after the reset is done.

Waiting for the process to terminate first looks like a workaround, when 
the real bug is maybe that we're not updating the process state 
correctly in pre-reset. All currently running processes should be put 
into evicted state, so they are not put back on the runlist after the reset.

Regards,
  Felix


Regards,

Philip


Regards,
  Felix


amdgpu_device_stop_pending_resets(adev);
        if (from_hypervisor)