RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

"Liu, Monk" <Monk.Liu@xxxxxxx> · Fri, 10 May 2019 04:08:28 +0000

Hi Christian 

>> Well exactly that's what I disagree on. A second timeout on the same job is perfectly possible and desired, we just don't need to necessarily report it once more
It is not a second timeout on the same job...if you disable gpu_recovery it is endless report on the same job ... you sure you need that ?

>> But the general problem here is that you are quite often coming up with individual patches ripped out of the context without explaining the general intention.
I don't like explain the work style here but looks it is needed as you complained that:
1) if I open the discussion in public and explain the general intention I'm afraid the schedule will not be satisfied,  a customer feature/requirement is not like our daily maintain that you can  discuss before you action, usually customer only want a prototype (although it may not perfect) first, and there are more time left for us to improve it later.

2) this endless reporting on the same job looks a general flaw to me (well I won't call it bug since from gpu-scheduler perspective it is not required that all vendor's callback of job_timout() will rearm the TIMER, so I agree that restart the TIMER in scheduler is the only option *if* don't change other vendor's code), the better design should something like optionally do the rearm in scheduler part according to the return value from job_timeout, ( 0 means need scheduler do rearm, other means not need)

>>> That is a rather bad idea because it doesn't handle killed daemon processes correctly. A better approach is let the daemon open up the virtual file and keep the connection open until the process exits.

UMD team has an existence daemon, their prototype is already made on signal mechanism, I prefer compatible with their way first  and if I cannot fix the issue you mentioned then I can switch to another way later. But I don't see it’s a problem for a prototype rightnow.

>>> But that's not because of the scheduler but rather because your design is missing another important piece: How does the daemon process signals the kernel driver that it is done with dumping the state? E.g. How do you signal the GIM that it can do a FLR again somehow?

When the daemon finishes its dumping it can simply write something to a debugfs and before that kernel driver will keep blocking the recover routine, as well as the FLR on GIM side...
Once receive the debugfs from that daemon, kernel driver can release GIM to do FLR, and continue its gpu recovery routine 

With a second thought, I think I don't need to fix that endless report actually, I can insert above logic in the gpu_recover() routine, before pre_asic_reset(), thus it only dump once per hang.
And if "gpu_recover=0" no dump will do so that endless reporting won't bother me ...

BTW:
We can check "sched_job.id" to judge if the upcoming timeout reporting is a repeated one and skip the waring string if it is repeated, 
I hat my DMESG all flushed by those identical warning all my screen ....

/Monk

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> 
Sent: Thursday, May 9, 2019 9:26 PM
To: Liu, Monk <Monk.Liu@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

[CAUTION: External Email]

Am 09.05.19 um 14:49 schrieb Liu, Monk:
> Christian
>
> I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that this is a bad design ...

Well exactly that's what I disagree on. A second timeout on the same job is perfectly possible and desired, we just don't need to necessarily report it once more.

> Besides you didn't on technique persuade me to believe there will be  bad things upcoming with my patch, you even didn't give me a sequence to prove my patch is wrong.
>
> Since you are the maintainer I would have no choice but keep my patch 
> the customer branch given the case you won't review it, it's only used 
> for AMDGPU virtualization customer  anyway

Well, I didn't say your patch is wrong. A NAK doesn't mean general rejection, but that you should start from the beginning and reiterate what you are trying to archive here. I mean what does a computer do when it receives the NAK? It retries the transmission.

But the general problem here is that you are quite often coming up with individual patches ripped out of the context without explaining the general intention.

> For the design of that feature, your feedback is welcomed:
>
> Backgrounds:
> Customer want sw and hw related information (pretty much infos we can collect by UMR) automatically dumped upon Any IB hang.

Yeah, in general that is a really good idea and discussed multiple times now.

> Design:
> - there would be an APP running in background as a daemon utility
> - in the beginning of this daemon it would write "register" to a 
> debugfs file like : /sys/kernel/debug/dri/0/amdgpu_dump_register
> - kernel driver would remember the PID/task of this daemon upon it 
> write "register" to that debugfs file

That is a rather bad idea because it doesn't handle killed daemon processes correctly. A better approach is let the daemon open up the virtual file and keep the connection open until the process exits.

> - since GIM driver would also detect GPU hang (which lead to a VF FLR), kernel driver should interacted with GIM to block GIM's FLR until the DUMP finished by that daemon.
> - upon an IB hang, job_timeout() would kick off and inside it KMD 
> would send an signal "USR1" to the registered task,
> - upon "USR1" signal received by the daemon app, it would call "UMR" 
> to dump whatever interested for the offline debugging

Signals are a bad idea for kernel->userspace communication, better use a pipe where data becomes available on a GPU reset.

> With current design if I pass "gpu_recover=0" to KMD there will be 
> infinite job_timeout() triggered for one IB, thus the auto DUMP would 
> also be crazy about its duplication

But that's not because of the scheduler but rather because your design is missing another important piece: How does the daemon process signals the kernel driver that it is done with dumping the state? E.g. How do you signal the GIM that it can do a FLR again somehow?

What you can do is to either block the timeout handler until the userspace process signals that it is done with dumping, or if that's not possible because of locks (etc) you could use the
drm_sched_suspend_timeout() and drm_sched_resume_timeout() helpers which allow us to disable the scheduler timeout for a moment.

Regards,
Christian.

>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Thursday, May 9, 2019 7:45 PM
> To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/sched: fix the duplicated TMO message for one 
> IB
>
> Well putting the other drivers aside for moment, I would also say that it is bad design to not restart the timeout. So even then I would not review that patch.
>
> Question is rather what are you actually trying to do and why don't you want to change your design?
>
> Designs should be discussed on the public mailing list and not done internally.
>
> Regards,
> Christian.
>

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx