Re: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type

André Almeida <andrealmeid@xxxxxxxxxx> · Mon, 24 Apr 2023 10:26:33 -0300

Hi Christian, thank you for your comments.

Em 24/04/2023 04:03, Christian König escreveu:
Am 24.04.23 um 03:43 schrieb André Almeida:
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.

However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.

Fix this by marking the context guilty for every type of reset when a
job timeouts.

The guilty handling is pretty much broken by design and only works 
because we go through multiple hops of validating the entity after the 
job has already been pushed to the hw.

I see, thanks.

I think we should probably just remove that completely and use an 
approach where we check the in flight submissions in the query state 
IOCTL.

Like the DRM_IOCTL_I915_GET_RESET_STATS approach?

> See my other patch on the mailing list regarding that.

Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence 
errors" series?

Additional to that I currently didn't considered soft-recovered 
submissions as fatal and continue accepting submissions from that 
context, but already wanted to talk with Marek about that behavior.

Interesting. I will try to test and validate this approach to see if the 
contexts keep working as expected on soft-resets.

Regards,
Christian.