Re: [PATCH] drm/sched: revert "drm_sched_job_cleanup(): correct false doc"

Christian König <christian.koenig@xxxxxxx> · Mon, 10 Mar 2025 15:28:27 +0100

Am 10.03.25 um 13:27 schrieb Tvrtko Ursulin:
>
> On 10/03/2025 12:11, Philipp Stanner wrote:
>> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
>>
>> OK, your arguments with fence ordering are strong. Please update the
>> commit message according to our discussion:
>
> Could that argument please be explained in more concrete terms?
>
> Are we talking here about skipping one seqno has potential to cause a problem, or there is more to it?
>
> Because if it is just skipping I don't immediately see that breaks the monotonic/unique seqno ordering.
>
> Only if we are worried about some code somewhere making assumptions  "if N got completed, that means N-1 got completed too". That generally isn't anything new and can happen with GPU resets, albeit in the latter case the fence error is I think always set.

Exactly that is highly problematic.

In a case of a reset and all pending work canceled it doesn't matter if fences are signaled A,B,C or C,B,A.

But when you can make fence C signal while A is still running it can be that we start to cleanup the VM and free memory etc.. while the shaders from A are still able to access resources.

That's a security hole you can push an elephant through. Virtual memory on GPUs mitigates that on modern hardware quite a bit, but we still have a bunch of use cases which rely on getting this right.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>> Sorry for the delayed response, I only stumbled over this now while
>>> going
>>> over old mails and then re-thinking my reviewed by for this change.
>>
>> Your RB hadn't even been applied (I merged before you gave it), so you
>> can remove this first paragraph from the commit message
>>
>>>
>>> The function drm_sched_job_arm() is indeed the point of no return.
>>> The
>>> background is that it is nearly impossible for the driver to
>>> correctly
>>> retract the fence and signal it in the order enforced by the
>>> dma_fence
>>> framework.
>>>
>>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>>> after
>>> the job was armed through drm_sched_job_arm() *and* processed by the
>>> scheduler.
>>>
>>> The correct approach for error handling in this situation is to set
>>> the
>>> error on the fences and then push to the entity anyway. We can
>>> certainly
>>> improve the documentation, but removing the warning is clearly not a
>>> good
>>> idea.
>>
>> This last paragraph, as per our discussion, seems invalid. We shouldn't
>> have that in the commit log, so that it won't give later hackers
>> browsing it wrong ideas and we end up with someone actually mengling
>> with those fences.
>>
>> Thx
>> P.
>>
>>>
>>> Signed-off-by: Christian König <christian.koenig@xxxxxxx>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 53e6aec37b46..4d4219fbe49d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>>    * Cleans up the resources allocated with drm_sched_job_init().
>>>    *
>>>    * Drivers should call this from their error unwind code if @job is
>>> aborted
>>> - * before it was submitted to an entity with
>>> drm_sched_entity_push_job().
>>> + * before drm_sched_job_arm() is called.
>>>    *
>>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>>> initialized,
>>> - * it is up to the driver to ensure that fences that were exposed to
>>> external
>>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>>> this.
>>> - *
>>> - * This function must also be called in &struct
>>> drm_sched_backend_ops.free_job
>>> + * After that point of no return @job is committed to be executed by
>>> the
>>> + * scheduler, and this function should be called from the
>>> + * &drm_sched_backend_ops.free_job callback.
>>>    */
>>>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>>>   {
>>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>>> *job)
>>>           /* drm_sched_job_arm() has been called */
>>>           dma_fence_put(&job->s_fence->finished);
>>>       } else {
>>> -        /* aborted job before arming */
>>> +        /* aborted job before committing to run it */
>>>           drm_sched_fence_free(job->s_fence);
>>>       }
>>>   
>>
>