On Mon, Feb 24, 2025 at 03:43:49PM +0100, Danilo Krummrich wrote: > On Mon, Feb 24, 2025 at 10:29:26AM -0300, Maíra Canal wrote: > > On 20/02/25 12:28, Philipp Stanner wrote: > > > On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote: > > > > Would it be possible to add a comment that `run_job()` must check if > > > > `s_fence->finished.error` is different than 0? If you increase the > > > > karma > > > > of a job and don't check for `s_fence->finished.error`, you might run > > > > a > > > > cancelled job. > > > > > > s_fence->finished is only signaled and its error set once the hardware > > > fence got signaled; or when the entity is killed. > > > > If you have a timeout, increase the karma of that job with > > `drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the > > latter will flag an error in the dma fence. If you don't check for it in > > `run_job()`, you will run the guilty job again. > > Considering that drm_sched_resubmit_jobs() is deprecated I don't think we need > to add this hint to the documentation; the drivers that are still using the API > hopefully got it right. > > > I'm still talking about `drm_sched_resubmit_jobs()`, because I'm > > currently fixing an issue in V3D with the GPU reset and we still use > > `drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and > > `timeout_job()` and the information I commented here (which was crucial > > to fix the bug) wasn't available there. > > Well, hopefully... :-) > > > > > `drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a > > new use in 2023 > > Yeah, that's a bit odd, since Xe relies on a firmware scheduler and uses a 1:1 > scheduler - entity setup. I'm a bit surprised Xe does use this function. > To clarify Xe's usage. We use this function to resubmit jobs after device reset for queues which had nothing to do with the device reset. In practice, a device should never occur as we have per-queue resets in our harwdare. If a per-queue reset occurs, we ban the queue rather than doing a resubmit. Matt > > for example. The commit that deprecated it just > > mentions AMD's case, but do we know if the function works as expected > > for the other users? > > I read the comment [1] you're referring to differently. It says that > "Re-submitting jobs was a concept AMD came up as cheap way to implement recovery > after a job timeout". > > It further explains that "there are many problem with the dma_fence > implementation and requirements. Either the implementation is risking deadlocks > with core memory management or violating documented implementation details of > the dma_fence object", which doesn't give any hint to me that the conceptual > issues are limited to amdgpu. > > > For V3D, it does. Also, we need to make it clear which > > are the dma fence requirements that the functions violates. > > This I fully agree with, unfortunately the comment does not explain what's the > issue at all. > > While I do think I have a vague idea of what's the potential issue with this > approach, I think it would be way better to get Christian, as the expert for DMA > fence rules to comment on this. > > @Christian: Can you please shed some light on this? > > > > > If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to > > provide a common interface for job resubmission? > > I wonder why this question did not come up when drm_sched_resubmit_jobs() was > deprecated two years ago, did it? > > Anyway, let's shed some light on the difficulties with drm_sched_resubmit_jobs() > and then we can figure out how we can do better. > > I think it would also be interesting to know how amdgpu handles job from > unrelated entities being discarded by not re-submitting them when a job from > another entitiy hangs the HW ring. > > [1] https://patchwork.freedesktop.org/patch/msgid/20221109095010.141189-5-christian.koenig@xxxxxxx