RE: [PATCH v2] drm/amdgpu: fix system hang issue during GPU reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - Internal Distribution Only]

Hi, Paul,
      I used our internal tool to make GPU hang and do stress test. In kernel, when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the  atomic  adev->in_gpu_reset is used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore  adev->reset_sem is introduced, which protect GPU from being accessed by external threads when doing recovery.

Best Regards
Dennis Li
-----Original Message-----
From: Paul Menzel <pmenzel+amd-gfx@xxxxxxxxxxxxx> 
Sent: Wednesday, July 8, 2020 7:42 PM
To: Li, Dennis <Dennis.Li@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Alex Deucher <alexdeucher@xxxxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>
Subject: Re: [PATCH v2] drm/amdgpu: fix system hang issue during GPU reset

Dear Dennis,


Thank you for you patch.

On 2020-07-08 09:48, Dennis Li wrote:
> During GPU reset, driver should hold on all external access to
> GPU, otherwise psp will randomly fail to do post, and then cause
> system hang.

Maybe update the commit message summary to read:

> Avoid external GPU access on GPU reset to fix system hang

As I am also experiencing system hangs, it would be great to have more
details. What systems are affected? What PSP firmware version? Will the
PSP firmware be fixed, or is the Linux driver violating the API.

How can the hang be reproduced?

Lastly, please explain your changes? Why does `atomic_read()` help for
example?

> v2:
> 1. add rwlock for some ioctls, debugfs and file-close function.
> 2. change to use dqm->is_resetting and dqm_lock for protection in kfd
> driver.
> 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
> re-enter GPU recovery for the same GPU hang.
> 
> Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx>
> Change-Id: I7f77a72795462587ed7d5f51fe53a594a0f1f708

[…]


Kind regards,

Paul
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux