Hi Christian, thank you for your comments.
Em 24/04/2023 04:03, Christian König escreveu:
Am 24.04.23 um 03:43 schrieb André Almeida:
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.
However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.
Fix this by marking the context guilty for every type of reset when a
job timeouts.
The guilty handling is pretty much broken by design and only works
because we go through multiple hops of validating the entity after the
job has already been pushed to the hw.
I see, thanks.
I think we should probably just remove that completely and use an
approach where we check the in flight submissions in the query state
IOCTL.
Like the DRM_IOCTL_I915_GET_RESET_STATS approach?
> See my other patch on the mailing list regarding that.
Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence
errors" series?
Additional to that I currently didn't considered soft-recovered
submissions as fatal and continue accepting submissions from that
context, but already wanted to talk with Marek about that behavior.
Interesting. I will try to test and validate this approach to see if the
contexts keep working as expected on soft-resets.
Regards,
Christian.