Am 10.03.25 um 13:11 schrieb Philipp Stanner: > On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote: >> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab. > OK, your arguments with fence ordering are strong. Please update the > commit message according to our discussion: > >> Sorry for the delayed response, I only stumbled over this now while >> going >> over old mails and then re-thinking my reviewed by for this change. > Your RB hadn't even been applied (I merged before you gave it), so you > can remove this first paragraph from the commit message > >> The function drm_sched_job_arm() is indeed the point of no return. >> The >> background is that it is nearly impossible for the driver to >> correctly >> retract the fence and signal it in the order enforced by the >> dma_fence >> framework. >> >> The code in drm_sched_job_cleanup() is for the purpose to cleanup >> after >> the job was armed through drm_sched_job_arm() *and* processed by the >> scheduler. >> >> The correct approach for error handling in this situation is to set >> the >> error on the fences and then push to the entity anyway. We can >> certainly >> improve the documentation, but removing the warning is clearly not a >> good >> idea. > This last paragraph, as per our discussion, seems invalid. We shouldn't > have that in the commit log, so that it won't give later hackers > browsing it wrong ideas and we end up with someone actually mengling > with those fences. Sure, going to make those updates. I just wanted to give people a possible direction to look into when they really run into a situation where they need to abort some submission very late. Should I also clarify the comment in drm_sched_job_cleanup()? Regards, Christian. > > Thx > P. > >> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 12 +++++------- >> 1 file changed, 5 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 53e6aec37b46..4d4219fbe49d 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency); >> * Cleans up the resources allocated with drm_sched_job_init(). >> * >> * Drivers should call this from their error unwind code if @job is >> aborted >> - * before it was submitted to an entity with >> drm_sched_entity_push_job(). >> + * before drm_sched_job_arm() is called. >> * >> - * Since calling drm_sched_job_arm() causes the job's fences to be >> initialized, >> - * it is up to the driver to ensure that fences that were exposed to >> external >> - * parties get signaled. drm_sched_job_cleanup() does not ensure >> this. >> - * >> - * This function must also be called in &struct >> drm_sched_backend_ops.free_job >> + * After that point of no return @job is committed to be executed by >> the >> + * scheduler, and this function should be called from the >> + * &drm_sched_backend_ops.free_job callback. >> */ >> void drm_sched_job_cleanup(struct drm_sched_job *job) >> { >> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job >> *job) >> /* drm_sched_job_arm() has been called */ >> dma_fence_put(&job->s_fence->finished); >> } else { >> - /* aborted job before arming */ >> + /* aborted job before committing to run it */ >> drm_sched_fence_free(job->s_fence); >> } >>