Just another ping, with Shyun's help I was able to do some smoke testing
on XGMI SRIOV system (booting and triggering hive reset)
and for now looks good.
Andrey
On 2022-01-28 14:36, Andrey Grodzovsky wrote:
Just a gentle ping if people have more comments on this patch set ?
Especially last 5 patches
as first 7 are exact same as V2 and we already went over them mostly.
Andrey
On 2022-01-25 17:37, Andrey Grodzovsky wrote:
This patchset is based on earlier work by Boris[1] that allowed to
have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through
the same
ordered wq and in this way simplify somewhat our GPU reset code so we
don't need
to protect from concurrency by multiple GPU reset triggeres such as
TDR on one
hand and sysfs trigger or RAS trigger on the other hand.
As advised by Christian and Daniel I defined a reset_domain struct
such that
all the entities that go through reset together will be serialized
one against
another.
TDR triggered by multiple entities within the same domain due to the
same reason will not
be triggered as the first such reset will cancel all the pending
resets. This is
relevant only to TDR timers and not to triggered resets coming from
RAS or SYSFS,
those will still happen after the in flight resets finishes.
v2:
Add handling on SRIOV configuration, the reset notify coming from host
and driver already trigger a work queue to handle the reset so drop this
intermediate wq and send directly to timeout wq. (Shaoyun)
v3:
Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain
struct.
I followed his advise and also moved adev->reset_sem into same place.
This
in turn caused to do some follow-up refactor of the original patches
where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive
because hive is destroyed and
reconstructed for the case of reset the devices in the XGMI hive
during probe for SRIOV See [2]
while we need the reset sem and gpu_reset flag to always be present.
This was attained
by adding refcount to amdgpu_reset_domain so each device can safely
point to it as long as
it needs.
[1]
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@xxxxxxxxxxxxx/
[2] https://www.spinics.net/lists/amd-gfx/msg58836.html
P.S Going through drm-misc-next and not amd-staging-drm-next as Boris
work hasn't landed yet there.
P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.
P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV
system because drm-misc-next fails to load there.
Would appriciate if maybe jingwech can try it on his system like he
tested V2.
Andrey Grodzovsky (12):
drm/amdgpu: Introduce reset domain
drm/amdgpu: Move scheduler init to after XGMI is ready
drm/amdgpu: Fix crash on modprobe
drm/amdgpu: Serialize non TDR gpu recovery with TDRs
drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
drm/amdgpu: Drop hive->in_reset
drm/amdgpu: Drop concurrent GPU reset protection for device
drm/amdgpu: Rework reset domain to be refcounted.
drm/amdgpu: Move reset sem into reset_domain
drm/amdgpu: Move in_gpu_reset into reset_domain
drm/amdgpu: Rework amdgpu_device_lock_adev
Revert 'drm/amdgpu: annotate a false positive recursive locking'
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 10 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 275 ++++++++++--------
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 43 +--
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 18 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 39 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 12 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +-
drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +-
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 19 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 19 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 11 +-
16 files changed, 313 insertions(+), 199 deletions(-)