RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

"Liu, Monk" <Monk.Liu@xxxxxxx> · Thu, 9 May 2019 12:49:46 +0000

Christian

I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that this is a bad design ...
Besides you didn't on technique persuade me to believe there will be  bad things upcoming with my patch, you even didn't give me a sequence to prove my patch is wrong.

Since you are the maintainer I would have no choice but keep my patch the customer branch given the case you won't review it, it's only used for AMDGPU virtualization customer  anyway

For the design of that feature, your feedback is welcomed:

Backgrounds:
Customer want sw and hw related information (pretty much infos we can collect by UMR) automatically dumped upon Any IB hang.

Design:
- there would be an APP running in background as a daemon utility
- in the beginning of this daemon it would write "register" to a debugfs file like : /sys/kernel/debug/dri/0/amdgpu_dump_register
- kernel driver would remember the PID/task of this daemon upon it write "register" to that debugfs file
- since GIM driver would also detect GPU hang (which lead to a VF FLR), kernel driver should interacted with GIM to block GIM's FLR until the DUMP finished by that daemon.
- upon an IB hang, job_timeout() would kick off and inside it KMD would send an signal "USR1" to the registered task,
- upon "USR1" signal received by the daemon app, it would call "UMR" to dump whatever interested for the offline debugging 

With current design if I pass "gpu_recover=0" to KMD there will be infinite job_timeout() triggered for one IB, thus the auto DUMP would also be crazy about its duplication 

/Monk

-----Original Message-----
From: Koenig, Christian 
Sent: Thursday, May 9, 2019 7:45 PM
To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

Well putting the other drivers aside for moment, I would also say that it is bad design to not restart the timeout. So even then I would not review that patch.

Question is rather what are you actually trying to do and why don't you want to change your design?

Designs should be discussed on the public mailing list and not done internally.

Regards,
Christian.

Am 09.05.19 um 13:40 schrieb Liu, Monk:
>>> And not only our driver is relying on that but also the ARM drivers. See the history of that change.
> That 's the issue I didn't see earlier, if ARM drivers doesn't restart 
> timer in their job_timeout() it is a problem
>
> But I don't want to change my plan on that feature, so can you give more explain on the aspect except this "ARM" driver ?
> Because I cannot even implement it my way in the customer branch if  
> except ARM reason there are other bad side effects
>
> /Monk
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
> Sent: Thursday, May 9, 2019 7:09 PM
> To: Liu, Monk <Monk.Liu@xxxxxxx>; Koenig, Christian 
> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for one 
> IB
>
> [CAUTION: External Email]
>
> Hi Monk,
>
> the timeout handler might only kill waves until we see some progress again and then continue.
>
> E.g. we don't necessary kill the whole job, but maybe just one drawing/computing operations. Or the same job is submitted another time.
> etc...
>
> As long as there is a job running on the hardware there should also be a timer running.
>
> And not only our driver is relying on that but also the ARM drivers. See the history of that change.
>
> When we don't want the timeout to fire again for this job we should probably remove the job from the ring mirror list and destroy it.
>
> Christian.
>
> Am 09.05.19 um 12:56 schrieb Liu, Monk:
>> Christian,
>> Why this is intentional that the same job keep reporting timeout ?
>> And why we cannot change that ?
>>
>> That's looks just a bug
>>
>> /Monk
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of 
>> Koenig, Christian
>> Sent: Thursday, May 9, 2019 6:54 PM
>> To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for 
>> one IB
>>
>> [CAUTION: External Email]
>>
>> Hi Monk,
>>
>> ok you don't seem to understand: It is intentional that the same job times out multiple times! So we can't really change anything here.
>>
>> What we can do is instead of sending a signal (which is not a good idea from the timeout handler anyway) we can start a background script to do the dumping.
>>
>> Now when the next timeout happens we double check if the background script is still running and if yes, simply ignore the timeout and wait for the next one.
>>
>> Christian.
>>
>> Am 09.05.19 um 12:49 schrieb Liu, Monk:
>>> Hah ... I see, but my requirement cannot be satisfied with current design:
>>>
>>> What I need to do is put a signal arming in job_timeout() to notify a USER SPACE daemon app , which finally leverage "UMR" to DUMP/retrieve sw/hw information related with the TMO/hang as much as possible .  To make it straight forward I would signal USR1 to the registered app (the daemon) every time a TMO happens,  and this would introduce lot of unnecessary DUMP if this "bug" cannot resolved.
>>>
>>> I think keep reporting TMO message for one IB is a "bug" to me even it is not for my above feature...
>>>
>>> To address your concern, what about this one:
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 1397942..31d99e9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -28,7 +28,7 @@
>>>     #include "amdgpu.h"
>>>     #include "amdgpu_trace.h"
>>>
>>> -static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>> +static int amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>     {
>>>            struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>>            struct amdgpu_job *job = to_amdgpu_job(s_job); @@ -39,7
>>> +39,7 @@ static void amdgpu_job_timedout(struct drm_sched_job 
>>> +*s_job)
>>>            if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>>>                    DRM_ERROR("ring %s timeout, but soft recovered\n",
>>>                              s_job->sched->name);
>>> -               return;
>>> +               return -ENODEV;
>>>            }
>>>
>>>            amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); @@
>>> -51,6 +51,8 @@ static void amdgpu_job_timedout(struct drm_sched_job
>>> *s_job)
>>>
>>>            if (amdgpu_device_should_recover_gpu(ring->adev))
>>>                    amdgpu_device_gpu_recover(ring->adev, job);
>>> +
>>> +       return 0;
>>>     }
>>>
>>>     int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned 
>>> num_ibs, diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index c1aaf85..b93deb1 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -308,18 +308,23 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>     {
>>>            struct drm_gpu_scheduler *sched;
>>>            struct drm_sched_job *job;
>>> -       unsigned long flags;
>>> +       int ret;
>>>
>>>            sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>            job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>                                           struct drm_sched_job, 
>>> node);
>>>
>>> -       if (job)
>>> -               job->sched->ops->timedout_job(job);
>>> +       if (job) {
>>> +               ret = job->sched->ops->timedout_job(job);
>>> +               if (ret) {
>>> +                       unsigned long flags;
>>>
>>> -       spin_lock_irqsave(&sched->job_list_lock, flags);
>>> -       drm_sched_start_timeout(sched);
>>> -       spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>> +                       /* means recovered, restart timer now */
>>> +                       spin_lock_irqsave(&sched->job_list_lock, flags);
>>> +                       drm_sched_start_timeout(sched);
>>> +                       spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>> +               }
>>> +       }
>>>     }
>>>
>>>      /**
>>> diff --git a/include/drm/gpu_scheduler.h 
>>> b/include/drm/gpu_scheduler.h index 9c2a957..c3884c3 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> @@ -229,7 +229,7 @@ struct drm_sched_backend_ops {
>>>              * @timedout_job: Called when a job has taken too long to execute,
>>>              * to trigger GPU recovery.
>>>             */
>>> -       void (*timedout_job)(struct drm_sched_job *sched_job);
>>> +       int (*timedout_job)(struct drm_sched_job *sched_job);
>>>
>>>            /**
>>>              * @free_job: Called once the job's finished fence has 
>>> been signaled
>>> (END)
>>>
>>>
>>> Thanks
>>> /Monk
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>> Sent: Thursday, May 9, 2019 6:30 PM
>>> To: Liu, Monk <Monk.Liu@xxxxxxx>; Koenig, Christian 
>>> <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for 
>>> one IB
>>>
>>> [CAUTION: External Email]
>>>
>>> drm_sched_start() is not necessary called from the timeout handler.
>>>
>>> If a soft recovery is sufficient, we just continue without a complete reset.
>>>
>>> Christian.
>>>
>>> Am 09.05.19 um 12:25 schrieb Liu, Monk:
>>>> Christian
>>>>
>>>> Check "drm_sched_start" which is invoked from gpu_recover() , there 
>>>> is a "drm_sched_start_timeout()" in the tail
>>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
>>>> Sent: Thursday, May 9, 2019 3:18 PM
>>>> To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for 
>>>> one IB
>>>>
>>>> [CAUTION: External Email]
>>>>
>>>> Am 09.05.19 um 06:31 schrieb Monk Liu:
>>>>> we don't need duplicated IB's timeout error message reported 
>>>>> endlessly, just one report per timedout IB is enough
>>>> Well, NAK. We don't need multiple timeout reports, but we really need to restart the timeout counter after handling it.
>>>>
>>>> Otherwise we will never run into a timeout again after handling one and it isn't unlikely that multiple IBs in a row doesn't work correctly.
>>>>
>>>> Christian.
>>>>
>>>>> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
>>>>> ---
>>>>>       drivers/gpu/drm/scheduler/sched_main.c | 5 -----
>>>>>       1 file changed, 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index c1aaf85..d6c17f1 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -308,7 +308,6 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>       {
>>>>>           struct drm_gpu_scheduler *sched;
>>>>>           struct drm_sched_job *job;
>>>>> -     unsigned long flags;
>>>>>
>>>>>           sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>           job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>> @@ -316,10 +315,6 @@ static void drm_sched_job_timedout(struct 
>>>>> work_struct *work)
>>>>>
>>>>>           if (job)
>>>>>                   job->sched->ops->timedout_job(job);
>>>>> -
>>>>> -     spin_lock_irqsave(&sched->job_list_lock, flags);
>>>>> -     drm_sched_start_timeout(sched);
>>>>> -     spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>>>>       }
>>>>>
>>>>>        /**
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx