Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2022-05-19 03:58, Christian König wrote:


Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:


On 2022-05-18 02:07, Christian König wrote:
Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to actively stop any pending reset
works while another in progress.

v2: Switch from generic list as was in v1[1] to eplicit
stopping of each reset request from each reset source
per each request submitter.

Looks mostly good to me.

Apart from the naming nit pick on patch #1 the only thing I couldn't of hand figure out is why you are using a delayed work everywhere instead of a just a work item.

That needs a bit further explanation what's happening here.

Christian.


Check APIs for cancelling work vs. delayed work -

For work_struct the only public API is this - https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 - blocking cancel.

For delayed_work we have both blocking and non blocking public APIs -

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

I prefer not to go now into convincing core kernel people of exposing another interface for our own sake - from my past experience API changes in core code has slim chances and a lot of time spent on back and forth arguments.

"If the mountain will not come to Muhammad, then Muhammad must go to the mountain" ;)


Ah, good point. The cancel_work() function was removed a few years ago:

commit 6417250d3f894e66a68ba1cd93676143f2376a6f
Author: Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx>
Date:   Tue Mar 6 19:34:42 2018 -0800

    workqueue: remove unused cancel_work()
    
    Found this by accident.
    There are no usages of bare cancel_work() in current kernel source.
    
    Signed-off-by: Stephen Hemminger <stephen@xxxxxxxxxxxxxxxxxx>
    Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>


Maybe just revert that patch, export the function and use it. I think there is plenty of justification for this.

Thanks,
Christian.


Ok - i will send them a patch - let's see what they say.

Andrey



Andrey



[1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@xxxxxxx/

Andrey Grodzovsky (7):
   drm/amdgpu: Cache result of last reset at reset domain level.
   drm/amdgpu: Switch to delayed work from work_struct.
   drm/admgpu: Serialize RAS recovery work directly into reset domain
     queue.
   drm/amdgpu: Add delayed work for GPU reset from debugfs
   drm/amdgpu: Add delayed work for GPU reset from kfd.
   drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
     amdgpu_device_gpu_recover
   drm/amdgpu: Stop any pending reset if another in progress.

  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
  14 files changed, 87 insertions(+), 54 deletions(-)




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux