Am 10.03.25 um 13:27 schrieb Tvrtko Ursulin: > > On 10/03/2025 12:11, Philipp Stanner wrote: >> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote: >>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab. >> >> OK, your arguments with fence ordering are strong. Please update the >> commit message according to our discussion: > > Could that argument please be explained in more concrete terms? > > Are we talking here about skipping one seqno has potential to cause a problem, or there is more to it? > > Because if it is just skipping I don't immediately see that breaks the monotonic/unique seqno ordering. > > Only if we are worried about some code somewhere making assumptions "if N got completed, that means N-1 got completed too". That generally isn't anything new and can happen with GPU resets, albeit in the latter case the fence error is I think always set. Exactly that is highly problematic. In a case of a reset and all pending work canceled it doesn't matter if fences are signaled A,B,C or C,B,A. But when you can make fence C signal while A is still running it can be that we start to cleanup the VM and free memory etc.. while the shaders from A are still able to access resources. That's a security hole you can push an elephant through. Virtual memory on GPUs mitigates that on modern hardware quite a bit, but we still have a bunch of use cases which rely on getting this right. Regards, Christian. > > Regards, > > Tvrtko > >>> Sorry for the delayed response, I only stumbled over this now while >>> going >>> over old mails and then re-thinking my reviewed by for this change. >> >> Your RB hadn't even been applied (I merged before you gave it), so you >> can remove this first paragraph from the commit message >> >>> >>> The function drm_sched_job_arm() is indeed the point of no return. >>> The >>> background is that it is nearly impossible for the driver to >>> correctly >>> retract the fence and signal it in the order enforced by the >>> dma_fence >>> framework. >>> >>> The code in drm_sched_job_cleanup() is for the purpose to cleanup >>> after >>> the job was armed through drm_sched_job_arm() *and* processed by the >>> scheduler. >>> >>> The correct approach for error handling in this situation is to set >>> the >>> error on the fences and then push to the entity anyway. We can >>> certainly >>> improve the documentation, but removing the warning is clearly not a >>> good >>> idea. >> >> This last paragraph, as per our discussion, seems invalid. We shouldn't >> have that in the commit log, so that it won't give later hackers >> browsing it wrong ideas and we end up with someone actually mengling >> with those fences. >> >> Thx >> P. >> >>> >>> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 12 +++++------- >>> 1 file changed, 5 insertions(+), 7 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 53e6aec37b46..4d4219fbe49d 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency); >>> * Cleans up the resources allocated with drm_sched_job_init(). >>> * >>> * Drivers should call this from their error unwind code if @job is >>> aborted >>> - * before it was submitted to an entity with >>> drm_sched_entity_push_job(). >>> + * before drm_sched_job_arm() is called. >>> * >>> - * Since calling drm_sched_job_arm() causes the job's fences to be >>> initialized, >>> - * it is up to the driver to ensure that fences that were exposed to >>> external >>> - * parties get signaled. drm_sched_job_cleanup() does not ensure >>> this. >>> - * >>> - * This function must also be called in &struct >>> drm_sched_backend_ops.free_job >>> + * After that point of no return @job is committed to be executed by >>> the >>> + * scheduler, and this function should be called from the >>> + * &drm_sched_backend_ops.free_job callback. >>> */ >>> void drm_sched_job_cleanup(struct drm_sched_job *job) >>> { >>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job >>> *job) >>> /* drm_sched_job_arm() has been called */ >>> dma_fence_put(&job->s_fence->finished); >>> } else { >>> - /* aborted job before arming */ >>> + /* aborted job before committing to run it */ >>> drm_sched_fence_free(job->s_fence); >>> } >>> >> >