[PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)

deathsimple@xxxxxxxxxxx (Christian König) · Wed, 10 May 2017 13:13:48 +0200

> You still avoid my question: whatâ??s the theoretical backend you that 
> you think check once instead of twice or even more is good**before** 
> hw_job_reset() ?
Because stopping all the scheduler threads takes a moment and it is 
entirely possible that the job finishes within that time.

> 1) if you check the fence and found it not signaled, then you will 
> call hw_job_reset(), but there is still chance that between your check 
> and the hw_job_reset() the
>
> Sched fence could signaled , isnâ??t it ? you still cannot avoid such 
> race condition
>
Crap, you're right. We would indeed need to check twice and that 
wouldn't be consistent anyway.

Once after stopping the schedulers and before hw_job_reset() because 
then we can just start the schedulers again and continue as if nothing 
has happened.

And another time after calling hw_job_reset() if we want to set the 
error code.

> Donâ??t forget your approach still have chance to hit the race 
> condition, and to me I donâ??t think the race condition matters thatâ??s 
> why I donâ??t even consider it
>
Yeah, you convinced me. Please go ahead with the current approach, but 
at least add a comment that we might want to improve that.

Regards,
Christian.

Am 10.05.2017 um 13:02 schrieb Liu, Monk:
>
> > Checking a second time is pointless since it can't signal any more 
> after calling amd_sched_hw_job_reset().
>
> [ML] you seems not response to me â?¦ I of cause know fence cannot 
> signal after hw_job_reset() â?¦.
>
> My question is , before you call hw_job_reset(), why you want to check 
> the fence ? why not check twice ?
>
> You still avoid my question: whatâ??s the theoretical backend you that 
> you think check once instead of twice or even more is good**before** 
> hw_job_reset() ?
>
> >No, the timeout is pretty meaningless. It's just the trigger that we 
> need to do something.
>
> 2) And even it signaled after entering gpu_reset(), it will 
> automatically done like normal cases, thatâ??s good. Why remove those 
> callback instead ?
>
> > No, that's absolutely not good. We don't know if it's the hardware 
> which results in the job being signaled or our reset code.
>
> [ML] you are wrong, for SR-IOV case, the timeout is all that matters, 
> because one VF can only have such time slice within timeout, and Iâ??m 
> doing the TDR on SR-IOV so donâ??t
>
> Always looks things with bare-metal mind
>
> >Otherwise we have a race condition here where we can't determine if 
> the reset finished the job or if it did just on it's own while we 
> stopped the scheduler.
>
> [ML] you are wrong :
> 1) if you check the fence and found it not signaled, then you will 
> call hw_job_reset(), but there is still chance that between your check 
> and the hw_job_reset() the
>
> Sched fence could signaled , isnâ??t it ? you still cannot avoid such 
> race condition
>
> 2) I donâ??t care if the fence is signaled due to its own finish or 
> because we force the hw fence signaled, for SR-IOV case, as long as 
> the job exceeds timeout, we consider
>
> It hang.
>
> 3) Even for bare-metal case, you still cannot sure if the fence is 
> signaled due to its own or hw_fence force signal, reason is in #1)
>
> >No, you are insist on a vague rules not strict, like I said, what is the 
> theoretic to backend your approach that only check once on the in 
> question job ? why not check again if not signaled ?
>
> >Because we have removed the connection between the job and the 
> hardware fence and because of this the job can never signal.
>
> [ML] like I said, before you call hw_job_reset(), you only check on 
> the job once, why not check again and again ?  I donâ??t see you have a 
> reason to only check once, and
>
> If you donâ??t have such reason I think you should not check at all.
>
> If you have such reason, prove me that only check once is good and enough.
>
> Donâ??t forget your approach still have chance to hit the race 
> condition, and to me I donâ??t think the race condition matters thatâ??s 
> why I donâ??t even consider it
>
> BR Monk
>
> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
> Behalf Of *Christian KÃ¶nig
> *Sent:* Wednesday, May 10, 2017 6:26 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)
>
> Am 10.05.2017 um 12:05 schrieb Liu, Monk:
>
>     [ML] yes, but we cannot guarantee the job is 100% really hang when
>     entering gpu_reset(), we can only trust our amdgpu_job_timeout as
>     a deadline for each job.
>
>     You approach that check the fence first before charge it as
>     guilty/hang is incorrect looks to me because why you not check it
>     twice, triple, and even more loops ?
>
>     Because the job can't signal any more after calling
>     amd_sched_hw_job_reset().
>
>     [ML] No â?¦ thatâ??s where I think your approach is vague:
>
>     1) see that you check after scheduler stopped, see if job
>     signaled, my question is if the job is not signaled (like most
>     usual case)
>
>     Why you not check it again and again ?  maybe the second time you
>     will find it signaled â?¦
>
>
> Checking a second time is pointless since it can't signal any more 
> after calling amd_sched_hw_job_reset().
>
>
>     My point is the checking here is meaningless, we already have
>     timedout for the guard.
>
> No, the timeout is pretty meaningless. It's just the trigger that we 
> need to do something.
>
> But to determine what to do we first need to stop the scheduler, 
> remove the hardware fence and THEN check the current status.
>
> Otherwise we have a race condition here where we can't determine if 
> the reset finished the job or if it did just on it's own while we 
> stopped the scheduler.
>
>
>     2) And even it signaled after entering gpu_reset(), it will
>     automatically done like normal cases, thatâ??s good. Why remove
>     those callback instead ?
>
> No, that's absolutely not good. We don't know if it's the hardware 
> which results in the job being signaled or our reset code.
>
>
>
>
>         So I refuse to check if @job is just signaled in gpu_reset,
>         because this action is vague (and no one can guarantee the job
>         wonâ??t signal during gpu_reset, we should not argue on this
>         event â?¦), I prefer clean and restrict rules.
>
>     Yeah, completely agree that we need to have struct rules for that.
>     That's why I insists on doing this :)
>
>     No, you are insist on a vague rules not strict, like I said, what
>     is the theoretic to backend your approach that only check once on
>     the in question job ? why not check again if not signaled ?
>
> Because we have removed the connection between the job and the 
> hardware fence and because of this the job can never signal.
>
> Regards,
> Christian.
>
>
>     I donâ??t agree this approach is clean and strict. You are abuse
>     timedout parameter.
>
>
>
>     See I just want to avoid problems for the case that the job
>     signaled while we stop the scheduler (because stopping the
>     scheduler actually can take a moment).
>
>     Because when this happened the scheduler could already have pushed
>     the next job to the hardware and then we abort it with the GPU
>     reset and might create more problems than we solve.
>
>
>
>
>
>         [ML] I donâ??t see my approach will have chance to fence twiceâ?¦
>         on the contrast I think my approach is more clear: no matter
>         the in question job finally signaled or not, I just kick it
>         out from mirror-list
>
>         Without remove the callback from hw fence, that way even it
>         really signaled during the gpu_reset() period the logic is
>         still perfect and its sched fence will act like usual â?¦
>
>     We want to set an error code on the job before signaling it don't
>     we? So we need to be sure how and when the job is signaled as
>     finished.
>
>     I mean letting it signal when we force the hardware fence to
>     complete will work as well, but I still think that this isn't as
>     clean as signaling it manually.
>
>     Please also see the helper function the Intel guys introduced
>     drm_fence_set_error(), we will run into a BUG_ON if we can't
>     guarantee the order of execution here.
>
>     Regards,
>     Christian.
>
>     Am 10.05.2017 um 06:00 schrieb Liu, Monk:
>
>         Christian,
>
>         Looks like we need more discuss with itâ?¦
>
>         Here is your approach:
>
>         1. Stop the scheduler from feeding more jobs to the hardware
>         when a jobs completes.  //this is where I agree with you
>
>         2. Then call hw_job_reset to remove the connection between job
>         and hardware fence.
>
>         3. Test if job is now completed. It is entirely possible that
>         the job completed while we started the work to reset the hardware.
>
>         Removing the connection between the hardware fence and the job
>         is the deadline, if the job completes after that it is lost.
>
>         4. Check the karma and kick out the job if we found it guilty.
>
>         5. Get the whole stuff working again, e.g. reset the hardware,
>         restart the scheduler etc...
>
>         */[ML]: One thing I agree to change with your way: in
>         gpu_reset() we should first stop the in question ringâ??s
>         scheduler (not the all) before kick out the guilty job./*
>
>         > Indeed, but I still think that this is a bad approach cause
>         we then reset the hardware without a good reason.
>
>         [ML] yes, but we cannot guarantee the job is 100% really hang
>         when entering gpu_reset(), we can only trust our
>         amdgpu_job_timeout as a deadline for each job.
>
>         You approach that check the fence first before charge it as
>         guilty/hang is incorrect looks to me because why you not check
>         it twice, triple, and even more loops ?
>
>         You check it one time and you found it just signaled thatâ??s
>         great and lucky(really luckyâ?¦), But what if it didnâ??t signaled
>         (like most usual case) , why not check it again and again ? do
>         you have a theoretic to support on how much time you need to
>         check before finally consider it hang ? No I donâ??t think you
>         have so please just cut this unnecessary checking, we already
>         use amdgpu_job_timeout to give the deadline of each job.
>
>         So I refuse to check if @job is just signaled in gpu_reset,
>         because this action is vague (and no one can guarantee the job
>         wonâ??t signal during gpu_reset, we should not argue on this
>         event â?¦), I prefer clean and restrict rules.
>
>         >Force completion is not so much of the issue, but rather in
>         which order you do things.
>
>         >See the original code first stops the scheduler and removes
>         the connection between hardware fence and job in an atomic
>         manner. And THEN forces the hardware fence to complete.
>
>         >This way we could be sure that nothing happened in parallel,
>         e.g. that we don't try to signal the fence twice or something
>         like that.
>
>         [ML] I donâ??t see my approach will have chance to fence twiceâ?¦
>         on the contrast I think my approach is more clear: no matter
>         the in question job finally signaled or not, I just kick it
>         out from mirror-list
>
>         Without remove the callback from hw fence, that way even it
>         really signaled during the gpu_reset() period the logic is
>         still perfect and its sched fence will act like usual â?¦
>
>         Please point out where or how my approach will go wrong like
>         â??e.g. that we don't try to signal the fence twice or something
>         like that.â??, otherwise I cannot be persuaded and fix my way â?¦
>
>         At last. I run the TDR test and it ends up with Hypervisor
>         side error, the guest side is all perfect, here is what I ran:
>
>         The vk_example and vulkan CTS test under MCBP case, we have
>         all kinds of hang on compute ring, without TDR this test wonâ??t
>         suffer for 5 seconds, and with TDR although MCBP is buggy now
>         but we can
>
>         Finish this test (of cause test result is mismatch due to MCBP
>         issue), and there are tongs of job_timed_out in dmesg, but
>         guest driver didnâ??t have any error report.  I was also
>         surprised this behave really stable â?¦
>
>         The second test the using â??watchâ?? to trigger a gpu hang (bad
>         command stream) every 2 seconds, with amdgpu_job_timeout also
>         set to 2seconds, I can run it till hypervisor side hit VF FLR
>         error, and guest side
>
>         Still runs perfectly and nothing wrong happened, with hw fence
>         seq number I can tell there are 3000 loops of TDR finished
>         before Hypervisor hit error.
>
>         BR Monk
>
>         *From:*Koenig, Christian
>         *Sent:* Tuesday, May 09, 2017 8:52 PM
>         *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>         Christian KÃ¶nig <deathsimple at vodafone.de>
>         <mailto:deathsimple at vodafone.de>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty
>         job TDR (V2)
>
>             [ML] if the job complete, the jobâ??s sched fence callback
>             will take this spin_lock and remove itself from
>             mirror_list, so we are still safe to call
>             amd_sched_job_kickout(), and it will do nothing if so
>
>         Indeed, but I still think that this is a bad approach cause we
>         then reset the hardware without a good reason.
>
>
>
>
>             Besides, original logic also force complete the hw fence,
>             and it works well â?¦
>
>         Force completion is not so much of the issue, but rather in
>         which order you do things.
>
>         See the original code first stops the scheduler and removes
>         the connection between hardware fence and job in an atomic
>         manner. And THEN forces the hardware fence to complete.
>
>         This way we could be sure that nothing happened in parallel,
>         e.g. that we don't try to signal the fence twice or something
>         like that.
>
>
>
>
>             State like â??You are missing that it is entirely possible
>             that the job will complete while we are trying to kick it
>             out.â??
>
>         Sorry I should have been more clear.
>
>
>
>
>             Is not a good reason to reject my approach, because that
>             is okay if the job just completed â?¦
>
>         We usually try to take a defensive approach, so stopping
>         everything, removing the hardware fence connection and then
>         explicitly kicking out the job in question sounds much better
>         than doing it implicitly with the hardware fence completion.
>
>         Even when this works (which I'm still not sure of) that is a
>         really awkward and hard to understand approach.
>
>         Regards,
>         Christian.
>
>         Am 09.05.2017 um 13:58 schrieb Liu, Monk:
>
>             You are missing that it is entirely possible that the job
>             will complete while we are trying to kick it out.
>
>             [ML] if the job complete, the jobâ??s sched fence callback
>             will take this spin_lock and remove itself from
>             mirror_list, so we are still safe to call
>             amd_sched_job_kickout(), and it will do nothing if so
>
>             Please go through the whole steps again,
>
>             Besides, original logic also force complete the hw fence,
>             and it works well â?¦
>
>             I donâ??t see the solid reason why you insist on your
>             approach, please go through the steps again  and give me
>             the details about where is incorrect than I can fix it
>
>             State like â??You are missing that it is entirely possible
>             that the job will complete while we are trying to kick it
>             out.â?? Is not a good reason to reject my approach, because
>             that is okay if the job just completed â?¦
>
>             BR Monk
>
>
>
>
>             *From:*Koenig, Christian
>             *Sent:* Tuesday, May 09, 2017 3:49 PM
>             *To:* Liu, Monk <Monk.Liu at amd.com>
>             <mailto:Monk.Liu at amd.com>; Christian KÃ¶nig
>             <deathsimple at vodafone.de>
>             <mailto:deathsimple at vodafone.de>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
>             guilty job TDR (V2)
>
>                 [ML] Really not necessary, we have spin_lock to
>                 protect the mirror-list, nothing will be messed up ...
>
>             You are missing that it is entirely possible that the job
>             will complete while we are trying to kick it out.
>
>
>
>
>
>                 [ML] why don't touch hardware fence at all ?  the
>                 original/bare-metal gpu reset also signal all ring's
>                 hardware fence first, I just follow the original logic ...
>                 Scheduler fence will be auto signaled after hw fence
>                 signaled, any problem with that ? what's the concern ?
>
>             The hardware runs async to the CPU which tries to reset
>             it, so we need to be careful in which order things are done.
>
>
>
>
>
>                 [ML] No I don't think so, the kickout must be prior to
>                 the hw_job_reset, otherwise the scheduler fence
>                 callback will be removed and you need manually install
>                 it later , which is not correct:
>                 For the guity job, we just kick it out before job
>                 reset, in job_reset we only reset other innocent jobs(
>                 and unbind the scheduler fence callback for them),
>                 after hw fence
>                 Forcely set to drv_seq, all hw fence are signaled
>                 (this is the way of original logic, I didn't change
>                 that).  When go to sched_recovery(), it will recover
>                 all innocent job and hook
>                 The scheduler fence with new hw fence.  That way only
>                 the guilty job is dropped forever.
>
>             Again same problem here.
>
>             To be absolutely sure that everything works as expected we
>             need to do it in the following order:
>
>             1. Stop the scheduler from feeding more jobs to the
>             hardware when a jobs completes.
>
>             2. Then call hw_job_reset to remove the connection between
>             job and hardware fence.
>
>             3. Test if job is now completed. It is entirely possible
>             that the job completed while we started the work to reset
>             the hardware.
>
>             Removing the connection between the hardware fence and the
>             job is the deadline, if the job completes after that it is
>             lost.
>
>             4. Check the karma and kick out the job if we found it guilty.
>
>             5. Get the whole stuff working again, e.g. reset the
>             hardware, restart the scheduler etc...
>
>             Regards,
>             Christian.
>
>             Am 09.05.2017 um 04:45 schrieb Liu, Monk:
>
>                 >
>                 > -     /* block scheduler */
>                 > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 > -             ring = adev->rings[i];
>                 > +     /* we start from the ring trigger GPU hang */
>                 > +     j = job ? job->ring->idx : 0;
>                 > +
>                 > +     if (job)
>                 > +             if
>                 (amd_sched_invalidate_job(&job->base,
>                 amdgpu_job_hang_limit))
>                 > + amd_sched_job_kickout(&job->base);
>
>                 Well that looks like the wrong order to me. We should
>                 probably stop the scheduler before trying to mess
>                 anything with the job.
>
>                 [ML] Really not necessary, we have spin_lock to
>                 protect the mirror-list, nothing will be messed up ...
>
>
>                 >
>                 > +void
>                 amdgpu_fence_driver_force_completion_ring(struct
>                 amdgpu_ring
>                 > +*ring) {
>                 > +     if (ring)
>                 > +             amdgpu_fence_write(ring,
>                 ring->fence_drv.sync_seq); }
>                 > +
>
>                 The coding style is completely off.
>
>                 [ML] I don't know why at email side it looks wrong
>                 coding style, but I'm sure it is correct at my side,
>                 check here:
>                 void amdgpu_fence_driver_force_completion_ring(struct
>                 amdgpu_ring *ring)
>                 {
>                         if (ring)
>                                 amdgpu_fence_write(ring,
>                 ring->fence_drv.sync_seq);
>                 }
>
>                 Additional to that I don't think that this is a good
>                 idea. We should probably rather just signal all
>                 scheduler fences instead and don't touch the hardware
>                 fence at all.
>
>                 [ML] why don't touch hardware fence at all ?  the
>                 original/bare-metal gpu reset also signal all ring's
>                 hardware fence first, I just follow the original logic ...
>                 Scheduler fence will be auto signaled after hw fence
>                 signaled, any problem with that ? what's the concern ?
>
>
>                 > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 > -             struct amdgpu_ring *ring = adev->rings[i];
>                 > +     for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                 > +             ring = adev->rings[i % AMDGPU_MAX_RINGS];
>                 >                if (!ring || !ring->sched.thread)
>                 >                        continue;
>                 >
>                 > +             if (job && j != i) {
>                 > + kthread_unpark(ring->sched.thread);
>                 > +                     continue;
>                 > +             }
>                 > +
>
>                 Please split up that patch a bit further. E.g. first
>                 the handling to only hw_job_reset the ring in
>                 question, then the kickout handling.
>
>                 [ML] No I don't think so, the kickout must be prior to
>                 the hw_job_reset, otherwise the scheduler fence
>                 callback will be removed and you need manually install
>                 it later , which is not correct:
>                 For the guity job, we just kick it out before job
>                 reset, in job_reset we only reset other innocent jobs(
>                 and unbind the scheduler fence callback for them),
>                 after hw fence
>                 Forcely set to drv_seq, all hw fence are signaled
>                 (this is the way of original logic, I didn't change
>                 that).  When go to sched_recovery(), it will recover
>                 all innocent job and hook
>                 The scheduler fence with new hw fence.  That way only
>                 the guilty job is dropped forever.
>
>                 -----Original Message-----
>                 From: Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
>                 Sent: Monday, May 08, 2017 9:12 PM
>                 To: Liu, Monk <Monk.Liu at amd.com>
>                 <mailto:Monk.Liu at amd.com>;
>                 amd-gfx at lists.freedesktop.org
>                 <mailto:amd-gfx at lists.freedesktop.org>
>                 Cc: Koenig, Christian <Christian.Koenig at amd.com>
>                 <mailto:Christian.Koenig at amd.com>
>                 Subject: Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
>                 guilty job TDR (V2)
>
>                 Am 08.05.2017 um 09:01 schrieb Liu, Monk:
>                 > @Christian
>                 >
>                 > This one is changed to guilty job scheme accordingly
>                 with your
>                 > response
>                 >
>                 > BR Monk
>                 >
>                 > -----Original Message-----
>                 > From: Monk Liu [mailto:Monk.Liu at amd.com]
>                 > Sent: Monday, May 08, 2017 3:00 PM
>                 > To: amd-gfx at lists.freedesktop.org
>                 <mailto:amd-gfx at lists.freedesktop.org>
>                 > Cc: Liu, Monk <Monk.Liu at amd.com>
>                 <mailto:Monk.Liu at amd.com>
>                 > Subject: [PATCH] drm/amdgpu/SRIOV:implement guilty
>                 job TDR (V2)
>                 >
>                 > 1,TDR will kickout guilty job if it hang exceed the
>                 threshold of the given one from kernel paramter
>                 "job_hang_limit", that way a bad command stream will
>                 not infinitly cause GPU hang.
>                 >
>                 > by default this threshold is 1 so a job will be
>                 kicked out after it hang.
>                 >
>                 > 2,if a job timeout TDR routine will not reset all
>                 sched/ring, instead if will only reset on the givn one
>                 which is indicated by @job of amdgpu_sriov_gpu_reset,
>                 that way we don't need to reset and recover each
>                 sched/ring if we already know which job cause GPU hang.
>                 >
>                 > 3,unblock sriov_gpu_reset for AI family.
>                 > 4,don't init entity for KIQ
>                 >
>                 > TODO:
>                 > when a job is considered as guilty, we should mark
>                 some flag in its fence status flag, and let UMD side
>                 aware that this fence signaling is not due to job
>                 complete but job hang.
>                 >
>                 > Change-Id: I7b89c19a3de93249db570d0a80522176b1525a09
>                 > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>                 <mailto:Monk.Liu at amd.com>
>                 > ---
>                 > drivers/gpu/drm/amd/amdgpu/amdgpu.h           |  1 +
>                 > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |  4 +++
>                 > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 36
>                 ++++++++++++++++++++-------
>                 > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  4 +++
>                 > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  6 +++++
>                 > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      |  1 +
>                 > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 11
>                 +++++++- drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                 |  7 ++++++
>                 >   8 files changed, 60 insertions(+), 10 deletions(-)
>                 >
>                 > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                 > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                 > index 90a69bf..93bcea2 100644
>                 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                 > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                 > @@ -111,6 +111,7 @@ extern int
>                 amdgpu_prim_buf_per_se;  extern int
>                 > amdgpu_pos_buf_per_se;  extern int
>                 amdgpu_cntl_sb_buf_per_se;  extern
>                 > int amdgpu_param_buf_per_se;
>                 > +extern int amdgpu_job_hang_limit;
>                 >
>                 >   #define AMDGPU_DEFAULT_GTT_SIZE_MB 3072ULL /* 3GB
>                 by default */
>                 >   #define
>                 AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS              3000
>                 > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                 > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                 > index b4bbbb3..23afc58 100644
>                 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                 > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                 > @@ -52,6 +52,10 @@ static int amdgpu_ctx_init(struct
>                 amdgpu_device *adev, struct amdgpu_ctx *ctx)
>                 >                struct amd_sched_rq *rq;
>                 >
>                 >                rq =
>                 &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
>                 > +
>                 > +             if (ring == &adev->gfx.kiq.ring)
>                 > +                     continue;
>                 > +
>
>                 That looks like a bug fix and should probably go into
>                 a separate patch.
>
>                 >                r =
>                 amd_sched_entity_init(&ring->sched, &ctx->rings[i].entity,
>                 >                                          rq,
>                 amdgpu_sched_jobs);
>                 >                if (r)
>                 > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > index 0e5f314..f3990fe 100644
>                 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                 > @@ -2537,7 +2537,7 @@ static int
>                 amdgpu_recover_vram_from_shadow(struct amdgpu_device
>                 *adev,
>                 >    */
>                 >   int amdgpu_sriov_gpu_reset(struct amdgpu_device
>                 *adev, struct amdgpu_job *job)  {
>                 > -     int i, r = 0;
>                 > +     int i, j, r = 0;
>                 >        int resched;
>                 >        struct amdgpu_bo *bo, *tmp;
>                 >        struct amdgpu_ring *ring;
>                 > @@ -2550,19 +2550,30 @@ int
>                 amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
>                 struct amdgpu_job *job)
>                 >        /* block TTM */
>                 >        resched =
>                 ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>                 >
>                 > -     /* block scheduler */
>                 > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 > -             ring = adev->rings[i];
>                 > +     /* we start from the ring trigger GPU hang */
>                 > +     j = job ? job->ring->idx : 0;
>                 > +
>                 > +     if (job)
>                 > +             if
>                 (amd_sched_invalidate_job(&job->base,
>                 amdgpu_job_hang_limit))
>                 > + amd_sched_job_kickout(&job->base);
>
>                 Well that looks like the wrong order to me. We should
>                 probably stop the scheduler before trying to mess
>                 anything with the job.
>
>                 >
>                 > +     /* block scheduler */
>                 > +     for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                 > +             ring = adev->rings[i % AMDGPU_MAX_RINGS];
>                 >                if (!ring || !ring->sched.thread)
>                 >                        continue;
>                 >
>                 > kthread_park(ring->sched.thread);
>                 > +
>                 > +             if (job && j != i)
>                 > +                     continue;
>                 > +
>                 > +             /* only do job_reset on the hang ring
>                 if @job not NULL */
>                 > amd_sched_hw_job_reset(&ring->sched);
>                 > -     }
>                 >
>                 > -     /* after all hw jobs are reset, hw fence is
>                 meaningless, so force_completion */
>                 > - amdgpu_fence_driver_force_completion(adev);
>                 > +             /* after all hw jobs are reset, hw
>                 fence is meaningless, so force_completion */
>                 > + amdgpu_fence_driver_force_completion_ring(ring);
>                 > +     }
>                 >
>                 >        /* request to take full control of GPU before
>                 re-initialization  */
>                 >        if (job)
>                 > @@ -2615,11 +2626,16 @@ int
>                 amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
>                 struct amdgpu_job *job)
>                 >        }
>                 >        fence_put(fence);
>                 >
>                 > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 > -             struct amdgpu_ring *ring = adev->rings[i];
>                 > +     for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                 > +             ring = adev->rings[i % AMDGPU_MAX_RINGS];
>                 >                if (!ring || !ring->sched.thread)
>                 >                        continue;
>                 >
>                 > +             if (job && j != i) {
>                 > + kthread_unpark(ring->sched.thread);
>                 > +                     continue;
>                 > +             }
>                 > +
>
>                 Please split up that patch a bit further. E.g. first
>                 the handling to only hw_job_reset the ring in
>                 question, then the kickout handling.
>
>                 > amd_sched_job_recovery(&ring->sched);
>                 > kthread_unpark(ring->sched.thread);
>                 >        }
>                 > @@ -2629,6 +2645,8 @@ int
>                 amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
>                 struct amdgpu_job *job)
>                 >        if (r) {
>                 >                /* bad news, how to tell it to
>                 userspace ? */
>                 >                dev_info(adev->dev, "GPU reset
>                 failed\n");
>                 > +     } else {
>                 > +             dev_info(adev->dev, "GPU reset
>                 successed!\n");
>                 >        }
>                 >
>                 >        adev->gfx.in_reset = false;
>                 > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                 > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                 > index 416908a..fd3691a8 100644
>                 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                 > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                 > @@ -112,6 +112,7 @@ int amdgpu_prim_buf_per_se = 0; 
>                 int
>                 > amdgpu_pos_buf_per_se = 0;  int
>                 amdgpu_cntl_sb_buf_per_se = 0;  int
>                 > amdgpu_param_buf_per_se = 0;
>                 > +int amdgpu_job_hang_limit = 0;
>                 >
>                 >   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for
>                 testing, in
>                 > megabytes");  module_param_named(vramlimit,
>                 amdgpu_vram_limit, int,
>                 > 0600); @@ -237,6 +238,9 @@
>                 module_param_named(cntl_sb_buf_per_se,
>                 > amdgpu_cntl_sb_buf_per_se, int, 0444);
>                 > MODULE_PARM_DESC(param_buf_per_se, "the size of
>                 Off-Chip Pramater
>                 > Cache per Shader Engine (default depending on gfx)");
>                 > module_param_named(param_buf_per_se,
>                 amdgpu_param_buf_per_se, int,
>                 > 0444);
>                 >
>                 > +MODULE_PARM_DESC(job_hang_limit, "how much time
>                 allow a job hang and
>                 > +not drop it (default 0)");
>                 module_param_named(job_hang_limit,
>                 > +amdgpu_job_hang_limit, int ,0444);
>                 > +
>                 >
>                 >   static const struct pci_device_id pciidlist[] = { 
>                 #ifdef
>                 > CONFIG_DRM_AMDGPU_SI diff --git
>                 > a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                 > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                 > index d7523d1..8de3bd3 100644
>                 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                 > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                 > @@ -541,6 +541,12 @@ void
>                 amdgpu_fence_driver_force_completion(struct
>                 amdgpu_device *adev)
>                 >        }
>                 >   }
>                 >
>                 > +void
>                 amdgpu_fence_driver_force_completion_ring(struct
>                 amdgpu_ring
>                 > +*ring) {
>                 > +     if (ring)
>                 > +             amdgpu_fence_write(ring,
>                 ring->fence_drv.sync_seq); }
>                 > +
>
>                 The coding style is completely off.
>
>                 Additional to that I don't think that this is a good
>                 idea. We should probably rather just signal all
>                 scheduler fences instead and don't touch the hardware
>                 fence at all.
>
>                 >   /*
>                 >    * Common fence implementation
>                 >    */
>                 > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                 > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                 > index 981ef08..03e88c6 100644
>                 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                 > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                 > @@ -76,6 +76,7 @@ struct amdgpu_fence_driver {  int
>                 > amdgpu_fence_driver_init(struct amdgpu_device
>                 *adev);  void
>                 > amdgpu_fence_driver_fini(struct amdgpu_device
>                 *adev);  void
>                 > amdgpu_fence_driver_force_completion(struct
>                 amdgpu_device *adev);
>                 > +void
>                 amdgpu_fence_driver_force_completion_ring(struct
>                 amdgpu_ring
>                 > +*ring);
>                 >
>                 >   int amdgpu_fence_driver_init_ring(struct
>                 amdgpu_ring *ring,
>                 >                                  unsigned
>                 num_hw_submission);
>                 > diff --git
>                 a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                 > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                 > index 6f4e31f..4e97e6d 100644
>                 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                 > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                 > @@ -390,9 +390,18 @@ void
>                 amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
>                 > &s_job->s_fence->cb)) {
>                 > fence_put(s_job->s_fence->parent);
>                 > s_job->s_fence->parent = NULL;
>                 > + atomic_dec(&sched->hw_rq_count);
>                 >                }
>                 >        }
>                 > -     atomic_set(&sched->hw_rq_count, 0);
>                 > + spin_unlock(&sched->job_list_lock);
>                 > +}
>                 > +
>                 > +void amd_sched_job_kickout(struct amd_sched_job
>                 *s_job) {
>                 > +     struct amd_gpu_scheduler *sched = s_job->sched;
>                 > +
>                 > + spin_lock(&sched->job_list_lock);
>                 > +     list_del_init(&s_job->node);
>                 > spin_unlock(&sched->job_list_lock);
>                 >   }
>                 >
>                 > diff --git
>                 a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                 > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                 > index 8cb41d3..59694f3 100644
>                 > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                 > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                 > @@ -81,6 +81,7 @@ struct amd_sched_job {
>                 >        struct list_head                node;
>                 >        struct delayed_work work_tdr;
>                 >        uint64_t                        id;
>                 > +     atomic_t karma;
>                 >   };
>                 >
>                 >   extern const struct fence_ops
>                 amd_sched_fence_ops_scheduled; @@ -96,6 +97,11 @@
>                 static inline struct amd_sched_fence
>                 *to_amd_sched_fence(struct fence *f)
>                 >        return NULL;
>                 >   }
>                 >
>                 > +static inline bool amd_sched_invalidate_job(struct
>                 amd_sched_job
>                 > +*s_job, int threshold) {
>                 > +     return (s_job &&
>                 atomic_inc_return(&s_job->karma) > threshold); }
>                 > +
>
>                 Again coding style is completely off.
>
>                 Christian.
>
>                 >   /**
>                 >    * Define the backend operations called by the
>                 scheduler,
>                 >    * these functions should be implemented in driver
>                 side @@ -158,4 +164,5 @@ int amd_sched_job_init(struct
>                 amd_sched_job *job,
>                 >                       void *owner);
>                 >   void amd_sched_hw_job_reset(struct
>                 amd_gpu_scheduler *sched);  void
>                 > amd_sched_job_recovery(struct amd_gpu_scheduler *sched);
>                 > +void amd_sched_job_kickout(struct amd_sched_job
>                 *s_job);
>                 >   #endif
>                 > --
>                 > 2.7.4
>                 >
>                 > _______________________________________________
>                 > amd-gfx mailing list
>                 > amd-gfx at lists.freedesktop.org
>                 <mailto:amd-gfx at lists.freedesktop.org>
>                 > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
>
>
>
>
>         _______________________________________________
>
>         amd-gfx mailing list
>
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>
>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170510/512c3ddc/attachment-0001.html>