RE: [PATCH] drm/sched: fix the bug of time out calculation(v3)

"Liu, Monk" <Monk.Liu@xxxxxxx> · Fri, 27 Aug 2021 07:46:54 +0000

[AMD Official Use Only]

Yeah, that "kthread_should_park" is also irrelevant looks to me as well and it delays the signaled job's cleanup/free

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> 
Sent: Friday, August 27, 2021 2:12 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3)

I don't think that this will be necessary nor desired.

See the job should be cleaned up as soon as possible after it is finished or otherwise we won't cancel the timeout quick enough either.

Christian.

Am 26.08.21 um 22:14 schrieb Andrey Grodzovsky:
> Attached quick patch for per job TTL calculation to make more precises 
> next timer expiration. It's on top of the patch in this thread. Let me 
> know if this makes sense.
>
> Andrey
>
> On 2021-08-26 10:03 a.m., Andrey Grodzovsky wrote:
>>
>> On 2021-08-26 12:55 a.m., Monk Liu wrote:
>>> issue:
>>> in cleanup_job the cancle_delayed_work will cancel a TO timer even 
>>> the its corresponding job is still running.
>>>
>>> fix:
>>> do not cancel the timer in cleanup_job, instead do the cancelling 
>>> only when the heading job is signaled, and if there is a "next" job 
>>> we start_timeout again.
>>>
>>> v2:
>>> further cleanup the logic, and do the TDR timer cancelling if the 
>>> signaled job is the last one in its scheduler.
>>>
>>> v3:
>>> change the issue description
>>> remove the cancel_delayed_work in the begining of the cleanup_job 
>>> recover the implement of drm_sched_job_begin.
>>>
>>> TODO:
>>> 1)introduce pause/resume scheduler in job_timeout to serial the 
>>> handling of scheduler and job_timeout.
>>> 2)drop the bad job's del and insert in scheduler due to above 
>>> serialization (no race issue anymore with the serialization)
>>>
>>> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 25 
>>> ++++++++++---------------
>>>   1 file changed, 10 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index a2a9536..ecf8140 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct 
>>> drm_gpu_scheduler *sched)
>>>   {
>>>       struct drm_sched_job *job, *next;
>>>   -    /*
>>> -     * Don't destroy jobs while the timeout worker is running OR 
>>> thread
>>> -     * is being parked and hence assumed to not touch pending_list
>>> -     */
>>> -    if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>> -        !cancel_delayed_work(&sched->work_tdr)) ||
>>> -        kthread_should_park())
>>> +    if (kthread_should_park())
>>>           return NULL;
>>
>>
>> I actually don't see why we need to keep the above, on the other side 
>> (in drm_sched_stop) we won't touch the pending list anyway until 
>> sched thread came to full stop (kthread_park). If you do see a reason 
>> why this needed then a comment should be here i think.
>>
>> Andrey
>>
>>
>>> spin_lock(&sched->job_list_lock);
>>> @@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct 
>>> drm_gpu_scheduler *sched)
>>>       if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>>           /* remove job from pending_list */
>>>           list_del_init(&job->list);
>>> +
>>> +        /* cancel this job's TO timer */
>>> +        cancel_delayed_work(&sched->work_tdr);
>>>           /* make the scheduled timestamp more accurate */
>>>           next = list_first_entry_or_null(&sched->pending_list,
>>>                           typeof(*next), list);
>>> -        if (next)
>>> +
>>> +        if (next) {
>>>               next->s_fence->scheduled.timestamp =
>>>                   job->s_fence->finished.timestamp;
>>> -
>>> +            /* start TO timer for next job */
>>> +            drm_sched_start_timeout(sched);
>>> +        }
>>>       } else {
>>>           job = NULL;
>>> -        /* queue timeout for next job */
>>> -        drm_sched_start_timeout(sched);
>>>       }
>>>         spin_unlock(&sched->job_list_lock);
>>> @@ -791,11 +789,8 @@ static int drm_sched_main(void *param)
>>>                         (entity = drm_sched_select_entity(sched))) 
>>> ||
>>>                        kthread_should_stop());
>>>   -        if (cleanup_job) {
>>> +        if (cleanup_job)
>>>               sched->ops->free_job(cleanup_job);
>>> -            /* queue timeout for next job */
>>> -            drm_sched_start_timeout(sched);
>>> -        }
>>>             if (!entity)
>>>               continue;