Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another.

TDR triggered by multiple entities within the same domain due to the same reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS,
those will still happen after the in flight resets finishes.

[1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@xxxxxxxxxxxxx/

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König <christian.koenig@xxxxxxx>

Some minor comments on the rest, but in general absolutely looks like the way we want to go.

Regards,
Christian.


Andrey Grodzovsky (6):
   drm/amdgpu: Init GPU reset single threaded wq
   drm/amdgpu: Move scheduler init to after XGMI is ready
   drm/amdgpu: Fix crash on modprobe
   drm/amdgpu: Serialize non TDR gpu recovery with TDRs
   drm/amdgpu: Drop hive->in_reset
   drm/amdgpu: Drop concurrent GPU reset protection for device

  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   9 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++++++++++----------
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
  7 files changed, 132 insertions(+), 136 deletions(-)





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux