[PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)

deathsimple@xxxxxxxxxxx (Christian König) · Wed, 10 May 2017 14:35:24 +0200

Am 10.05.2017 um 13:50 schrieb Liu, Monk:
>
> >Because stopping all the scheduler threads takes a moment and it is 
> entirely possible that the job finishes within that time.
>
> Sounds reasonable, but instead not, because if so why not increase 
> timed_out value to + 0.5 sec ? that should cover the time on stop all 
> schedulers,
>

That won't work, see you need to stop the scheduler to figure out if the 
job is actually hung.

> So the point is we should rely on and trust timedout parm since it is 
> the only way to let kernel side aware GPU may hang â?¦
>
> If there is a way to detect GPU hang (actually we have such method, 
> implemented in RLCV side) in kernel driver side, that will convince me
>
> To check the fence at least one time, like:
>
> Void Gpu_reset()
>
> {
>
> Hang = Check_gpu_hang_or_busy();
>
> If (!hang) {
>
> //looks like the job is still running, we can let it run â?¦
>
> Return;
>
> } else {
>
> Signaled = fence_signaled(job);
>
> If (signaled)
>
> Return; //do nothing
>
> }
>
> â?¦
>
> }
>

Well what we actually need to do is to stop the hardware from doing any 
processing, but that's easier said than done.

> I can say that we may still have chance to implement this 
> hang_detection in kernel driver side, but it is not easy so currently 
> I only trust timedout event
>

I still don't think that this is a good idea. At least on bare metal we 
need more checks than that to determine if we really should reset the 
hardware or not.

Regards,
Christian.

> BR Monk
>
> *From:*Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
> *Sent:* Wednesday, May 10, 2017 7:14 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)
>
>     You still avoid my question: whatâ??s the theoretical backend you
>     that you think check once instead of twice or even more is
>     good**before** hw_job_reset() ?
>
> Because stopping all the scheduler threads takes a moment and it is 
> entirely possible that the job finishes within that time.
>
>
>     1) if you check the fence and found it not signaled, then you will
>     call hw_job_reset(), but there is still chance that between your
>     check and the hw_job_reset() the
>
>     Sched fence could signaled , isnâ??t it ?  you still cannot avoid
>     such race condition
>
> Crap, you're right. We would indeed need to check twice and that 
> wouldn't be consistent anyway.
>
> Once after stopping the schedulers and before hw_job_reset() because 
> then we can just start the schedulers again and continue as if nothing 
> has happened.
>
> And another time after calling hw_job_reset() if we want to set the 
> error code.
>
>
>     Donâ??t forget your approach still have chance to hit the race
>     condition, and to me I donâ??t think the race condition matters
>     thatâ??s why I donâ??t even consider it
>
> Yeah, you convinced me. Please go ahead with the current approach, but 
> at least add a comment that we might want to improve that.
>
> Regards,
> Christian.
>
> Am 10.05.2017 um 13:02 schrieb Liu, Monk:
>
>     > Checking a second time is pointless since it can't signal any
>     more after calling amd_sched_hw_job_reset().
>
>     [ML] you seems not response to me â?¦ I of cause know fence cannot
>     signal after hw_job_reset() â?¦.
>
>     My question is , before you call hw_job_reset(), why you want to
>     check the fence ? why not check twice ?
>
>     You still avoid my question: whatâ??s the theoretical backend you
>     that you think check once instead of twice or even more is
>     good**before** hw_job_reset() ?
>
>     >No, the timeout is pretty meaningless. It's just the trigger that
>     we need to do something.
>
>     2) And even it signaled after entering gpu_reset(), it will
>     automatically done like normal cases, thatâ??s good. Why remove
>     those callback instead ?
>
>     > No, that's absolutely not good. We don't know if it's the
>     hardware which results in the job being signaled or our reset code.
>
>     [ML] you are wrong, for SR-IOV case, the timeout is all that
>     matters, because one VF can only have such time slice within
>     timeout, and Iâ??m doing the TDR on SR-IOV so donâ??t
>
>     Always looks things with bare-metal mind
>
>     >Otherwise we have a race condition here where we can't determine
>     if the reset finished the job or if it did just on it's own while
>     we stopped the scheduler.
>
>     [ML] you are wrong :
>     1) if you check the fence and found it not signaled, then you will
>     call hw_job_reset(), but there is still chance that between your
>     check and the hw_job_reset() the
>
>     Sched fence could signaled , isnâ??t it ? you still cannot avoid
>     such race condition
>
>     2) I donâ??t care if the fence is signaled due to its own finish or
>     because we force the hw fence signaled, for SR-IOV case, as long
>     as the job exceeds timeout, we consider
>
>     It hang.
>
>     3) Even for bare-metal case, you still cannot sure if the fence is
>     signaled due to its own or hw_fence force signal, reason is in #1)
>
>     >No, you are insist on a vague rules not strict, like I said, what is
>     the theoretic to backend your approach that only check once on the
>     in question job ? why not check again if not signaled ?
>
>     >Because we have removed the connection between the job and the
>     hardware fence and because of this the job can never signal.
>
>     [ML] like I said, before you call hw_job_reset(), you only check
>     on the job once, why not check again and again ?  I donâ??t see you
>     have a reason to only check once, and
>
>     If you donâ??t have such reason I think you should not check at all.
>
>     If you have such reason, prove me that only check once is good and
>     enough.
>
>     Donâ??t forget your approach still have chance to hit the race
>     condition, and to me I donâ??t think the race condition matters
>     thatâ??s why I donâ??t even consider it
>
>     BR Monk
>
>     *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On
>     Behalf Of *Christian KÃ¶nig
>     *Sent:* Wednesday, May 10, 2017 6:26 PM
>     *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>     Koenig, Christian <Christian.Koenig at amd.com>
>     <mailto:Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job
>     TDR (V2)
>
>     Am 10.05.2017 um 12:05 schrieb Liu, Monk:
>
>         [ML] yes, but we cannot guarantee the job is 100% really hang
>         when entering gpu_reset(), we can only trust our
>         amdgpu_job_timeout as a deadline for each job.
>
>         You approach that check the fence first before charge it as
>         guilty/hang is incorrect looks to me because why you not check
>         it twice, triple, and even more loops ?
>
>         Because the job can't signal any more after calling
>         amd_sched_hw_job_reset().
>
>         [ML] No â?¦ thatâ??s where I think your approach is vague:
>
>         1) see that you check after scheduler stopped, see if job
>         signaled, my question is if the job is not signaled (like most
>         usual case)
>
>         Why you not check it again and again ?  maybe the second time
>         you will find it signaled â?¦
>
>
>     Checking a second time is pointless since it can't signal any more
>     after calling amd_sched_hw_job_reset().
>
>
>
>         My point is the checking here is meaningless, we already have
>         timedout for the guard.
>
>     No, the timeout is pretty meaningless. It's just the trigger that
>     we need to do something.
>
>     But to determine what to do we first need to stop the scheduler,
>     remove the hardware fence and THEN check the current status.
>
>     Otherwise we have a race condition here where we can't determine
>     if the reset finished the job or if it did just on it's own while
>     we stopped the scheduler.
>
>
>
>         2) And even it signaled after entering gpu_reset(), it will
>         automatically done like normal cases, thatâ??s good. Why remove
>         those callback instead ?
>
>     No, that's absolutely not good. We don't know if it's the hardware
>     which results in the job being signaled or our reset code.
>
>
>
>
>
>
>             So I refuse to check if @job is just signaled in
>             gpu_reset, because this action is vague (and no one can
>             guarantee the job wonâ??t signal during gpu_reset, we should
>             not argue on this event â?¦), I prefer clean and restrict rules.
>
>         Yeah, completely agree that we need to have struct rules for
>         that. That's why I insists on doing this :)
>
>         No, you are insist on a vague rules not strict, like I said,
>         what is the theoretic to backend your approach that only check
>         once on the in question job ? why not check again if not
>         signaled ?
>
>     Because we have removed the connection between the job and the
>     hardware fence and because of this the job can never signal.
>
>     Regards,
>     Christian.
>
>
>
>         I donâ??t agree this approach is clean and strict. You are abuse
>         timedout parameter.
>
>
>
>         See I just want to avoid problems for the case that the job
>         signaled while we stop the scheduler (because stopping the
>         scheduler actually can take a moment).
>
>         Because when this happened the scheduler could already have
>         pushed the next job to the hardware and then we abort it with
>         the GPU reset and might create more problems than we solve.
>
>
>
>
>
>
>             [ML] I donâ??t see my approach will have chance to fence
>             twiceâ?¦ on the contrast I think my approach is more clear:
>             no matter the in question job finally signaled or not, I
>             just kick it out from mirror-list
>
>             Without remove the callback from hw fence, that way even
>             it really signaled during the gpu_reset() period the logic
>             is still perfect and its sched fence will act like usual â?¦
>
>         We want to set an error code on the job before signaling it
>         don't we? So we need to be sure how and when the job is
>         signaled as finished.
>
>         I mean letting it signal when we force the hardware fence to
>         complete will work as well, but I still think that this isn't
>         as clean as signaling it manually.
>
>         Please also see the helper function the Intel guys introduced
>         drm_fence_set_error(), we will run into a BUG_ON if we can't
>         guarantee the order of execution here.
>
>         Regards,
>         Christian.
>
>         Am 10.05.2017 um 06:00 schrieb Liu, Monk:
>
>             Christian,
>
>             Looks like we need more discuss with itâ?¦
>
>             Here is your approach:
>
>             1. Stop the scheduler from feeding more jobs to the
>             hardware when a jobs completes.  //this is where I agree
>             with you
>
>             2. Then call hw_job_reset to remove the connection between
>             job and hardware fence.
>
>             3. Test if job is now completed. It is entirely possible
>             that the job completed while we started the work to reset
>             the hardware.
>
>             Removing the connection between the hardware fence and the
>             job is the deadline, if the job completes after that it is
>             lost.
>
>             4. Check the karma and kick out the job if we found it guilty.
>
>             5. Get the whole stuff working again, e.g. reset the
>             hardware, restart the scheduler etc...
>
>             */[ML]: One thing I agree to change with your way: in
>             gpu_reset() we should first stop the in question ringâ??s
>             scheduler (not the all) before kick out the guilty job./*
>
>             > Indeed, but I still think that this is a bad approach
>             cause we then reset the hardware without a good reason.
>
>             [ML] yes, but we cannot guarantee the job is 100% really
>             hang when entering gpu_reset(), we can only trust our
>             amdgpu_job_timeout as a deadline for each job.
>
>             You approach that check the fence first before charge it
>             as guilty/hang is incorrect looks to me because why you
>             not check it twice, triple, and even more loops ?
>
>             You check it one time and you found it just signaled
>             thatâ??s great and lucky(really luckyâ?¦), But what if it
>             didnâ??t signaled (like most usual case) , why not check it
>             again and again ? do you have a theoretic to support on
>             how much time you need to check before finally consider it
>             hang ? No I donâ??t think you have so please just cut this
>             unnecessary checking, we already use amdgpu_job_timeout to
>             give the deadline of each job.
>
>             So I refuse to check if @job is just signaled in
>             gpu_reset, because this action is vague (and no one can
>             guarantee the job wonâ??t signal during gpu_reset, we should
>             not argue on this event â?¦), I prefer clean and restrict rules.
>
>             >Force completion is not so much of the issue, but rather
>             in which order you do things.
>
>             >See the original code first stops the scheduler and
>             removes the connection between hardware fence and job in
>             an atomic manner. And THEN forces the hardware fence to
>             complete.
>
>             >This way we could be sure that nothing happened in
>             parallel, e.g. that we don't try to signal the fence twice
>             or something like that.
>
>             [ML] I donâ??t see my approach will have chance to fence
>             twiceâ?¦ on the contrast I think my approach is more clear:
>             no matter the in question job finally signaled or not, I
>             just kick it out from mirror-list
>
>             Without remove the callback from hw fence, that way even
>             it really signaled during the gpu_reset() period the logic
>             is still perfect and its sched fence will act like usual â?¦
>
>             Please point out where or how my approach will go wrong
>             like â??e.g. that we don't try to signal the fence twice or
>             something like that.â??, otherwise I cannot be persuaded and
>             fix my way â?¦
>
>             At last. I run the TDR test and it ends up with Hypervisor
>             side error, the guest side is all perfect, here is what I ran:
>
>             The vk_example and vulkan CTS test under MCBP case, we
>             have all kinds of hang on compute ring, without TDR this
>             test wonâ??t suffer for 5 seconds, and with TDR although
>             MCBP is buggy now but we can
>
>             Finish this test (of cause test result is mismatch due to
>             MCBP issue), and there are tongs of job_timed_out in
>             dmesg, but guest driver didnâ??t have any error report.  I
>             was also surprised this behave really stable â?¦
>
>             The second test the using â??watchâ?? to trigger a gpu hang
>             (bad command stream) every 2 seconds, with
>             amdgpu_job_timeout also set to 2seconds, I can run it till
>             hypervisor side hit VF FLR error, and guest side
>
>             Still runs perfectly and nothing wrong happened, with hw
>             fence seq number I can tell there are 3000 loops of TDR
>             finished before Hypervisor hit error.
>
>             BR Monk
>
>             *From:*Koenig, Christian
>             *Sent:* Tuesday, May 09, 2017 8:52 PM
>             *To:* Liu, Monk <Monk.Liu at amd.com>
>             <mailto:Monk.Liu at amd.com>; Christian KÃ¶nig
>             <deathsimple at vodafone.de>
>             <mailto:deathsimple at vodafone.de>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
>             guilty job TDR (V2)
>
>                 [ML] if the job complete, the jobâ??s sched fence
>                 callback will take this spin_lock and remove itself
>                 from mirror_list, so we are still safe to call
>                 amd_sched_job_kickout(), and it will do nothing if so
>
>             Indeed, but I still think that this is a bad approach
>             cause we then reset the hardware without a good reason.
>
>
>
>
>
>                 Besides, original logic also force complete the hw
>                 fence, and it works well â?¦
>
>             Force completion is not so much of the issue, but rather
>             in which order you do things.
>
>             See the original code first stops the scheduler and
>             removes the connection between hardware fence and job in
>             an atomic manner. And THEN forces the hardware fence to
>             complete.
>
>             This way we could be sure that nothing happened in
>             parallel, e.g. that we don't try to signal the fence twice
>             or something like that.
>
>
>
>
>
>                 State like â??You are missing that it is entirely
>                 possible that the job will complete while we are
>                 trying to kick it out.â??
>
>             Sorry I should have been more clear.
>
>
>
>
>
>                 Is not a good reason to reject my approach, because
>                 that is okay if the job just completed â?¦
>
>             We usually try to take a defensive approach, so stopping
>             everything, removing the hardware fence connection and
>             then explicitly kicking out the job in question sounds
>             much better than doing it implicitly with the hardware
>             fence completion.
>
>             Even when this works (which I'm still not sure of) that is
>             a really awkward and hard to understand approach.
>
>             Regards,
>             Christian.
>
>             Am 09.05.2017 um 13:58 schrieb Liu, Monk:
>
>                 You are missing that it is entirely possible that the
>                 job will complete while we are trying to kick it out.
>
>                 [ML] if the job complete, the jobâ??s sched fence
>                 callback will take this spin_lock and remove itself
>                 from mirror_list, so we are still safe to call
>                 amd_sched_job_kickout(), and it will do nothing if so
>
>                 Please go through the whole steps again,
>
>                 Besides, original logic also force complete the hw
>                 fence, and it works well â?¦
>
>                 I donâ??t see the solid reason why you insist on your
>                 approach, please go through the steps again  and give
>                 me the details about where is incorrect than I can fix it
>
>                 State like â??You are missing that it is entirely
>                 possible that the job will complete while we are
>                 trying to kick it out.â?? Is not a good reason to reject
>                 my approach, because that is okay if the job just
>                 completed â?¦
>
>                 BR Monk
>
>
>
>
>
>                 *From:*Koenig, Christian
>                 *Sent:* Tuesday, May 09, 2017 3:49 PM
>                 *To:* Liu, Monk <Monk.Liu at amd.com>
>                 <mailto:Monk.Liu at amd.com>; Christian KÃ¶nig
>                 <deathsimple at vodafone.de>
>                 <mailto:deathsimple at vodafone.de>;
>                 amd-gfx at lists.freedesktop.org
>                 <mailto:amd-gfx at lists.freedesktop.org>
>                 *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
>                 guilty job TDR (V2)
>
>                     [ML] Really not necessary, we have spin_lock to
>                     protect the mirror-list, nothing will be messed up ...
>
>                 You are missing that it is entirely possible that the
>                 job will complete while we are trying to kick it out.
>
>
>
>
>
>
>                     [ML] why don't touch hardware fence at all ?  the
>                     original/bare-metal gpu reset also signal all
>                     ring's hardware fence first, I just follow the
>                     original logic ...
>                     Scheduler fence will be auto signaled after hw
>                     fence signaled, any problem with that ? what's the
>                     concern ?
>
>                 The hardware runs async to the CPU which tries to
>                 reset it, so we need to be careful in which order
>                 things are done.
>
>
>
>
>
>
>                     [ML] No I don't think so, the kickout must be
>                     prior to the hw_job_reset, otherwise the scheduler
>                     fence callback will be removed and you need
>                     manually install it later , which is not correct:
>                     For the guity job, we just kick it out before job
>                     reset, in job_reset we only reset other innocent
>                     jobs( and unbind the scheduler fence callback for
>                     them), after hw fence
>                     Forcely set to drv_seq, all hw fence are signaled
>                     (this is the way of original logic, I didn't
>                     change that).  When go to sched_recovery(), it
>                     will recover all innocent job and hook
>                     The scheduler fence with new hw fence.  That way
>                     only the guilty job is dropped forever.
>
>                 Again same problem here.
>
>                 To be absolutely sure that everything works as
>                 expected we need to do it in the following order:
>
>                 1. Stop the scheduler from feeding more jobs to the
>                 hardware when a jobs completes.
>
>                 2. Then call hw_job_reset to remove the connection
>                 between job and hardware fence.
>
>                 3. Test if job is now completed. It is entirely
>                 possible that the job completed while we started the
>                 work to reset the hardware.
>
>                 Removing the connection between the hardware fence and
>                 the job is the deadline, if the job completes after
>                 that it is lost.
>
>                 4. Check the karma and kick out the job if we found it
>                 guilty.
>
>                 5. Get the whole stuff working again, e.g. reset the
>                 hardware, restart the scheduler etc...
>
>                 Regards,
>                 Christian.
>
>                 Am 09.05.2017 um 04:45 schrieb Liu, Monk:
>
>                     >
>                     > -     /* block scheduler */
>                     > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                     > -             ring = adev->rings[i];
>                     > +     /* we start from the ring trigger GPU hang */
>                     > +     j = job ? job->ring->idx : 0;
>                     > +
>                     > +     if (job)
>                     > +             if
>                     (amd_sched_invalidate_job(&job->base,
>                     amdgpu_job_hang_limit))
>                     > + amd_sched_job_kickout(&job->base);
>
>                     Well that looks like the wrong order to me. We
>                     should probably stop the scheduler before trying
>                     to mess anything with the job.
>
>                     [ML] Really not necessary, we have spin_lock to
>                     protect the mirror-list, nothing will be messed up ...
>
>
>                     >
>                     > +void
>                     amdgpu_fence_driver_force_completion_ring(struct
>                     amdgpu_ring
>                     > +*ring) {
>                     > +     if (ring)
>                     > +             amdgpu_fence_write(ring,
>                     ring->fence_drv.sync_seq); }
>                     > +
>
>                     The coding style is completely off.
>
>                     [ML] I don't know why at email side it looks wrong
>                     coding style, but I'm sure it is correct at my
>                     side, check here:
>                     void
>                     amdgpu_fence_driver_force_completion_ring(struct
>                     amdgpu_ring *ring)
>                     {
>                             if (ring)
>                                     amdgpu_fence_write(ring,
>                     ring->fence_drv.sync_seq);
>                     }
>
>                     Additional to that I don't think that this is a
>                     good idea. We should probably rather just signal
>                     all scheduler fences instead and don't touch the
>                     hardware fence at all.
>
>                     [ML] why don't touch hardware fence at all ? the
>                     original/bare-metal gpu reset also signal all
>                     ring's hardware fence first, I just follow the
>                     original logic ...
>                     Scheduler fence will be auto signaled after hw
>                     fence signaled, any problem with that ? what's the
>                     concern ?
>
>
>                     > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                     > -             struct amdgpu_ring *ring =
>                     adev->rings[i];
>                     > +     for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                     > +             ring = adev->rings[i %
>                     AMDGPU_MAX_RINGS];
>                     >                if (!ring || !ring->sched.thread)
>                     >                        continue;
>                     >
>                     > +             if (job && j != i) {
>                     > + kthread_unpark(ring->sched.thread);
>                     > +                     continue;
>                     > +             }
>                     > +
>
>                     Please split up that patch a bit further. E.g.
>                     first the handling to only hw_job_reset the ring
>                     in question, then the kickout handling.
>
>                     [ML] No I don't think so, the kickout must be
>                     prior to the hw_job_reset, otherwise the scheduler
>                     fence callback will be removed and you need
>                     manually install it later , which is not correct:
>                     For the guity job, we just kick it out before job
>                     reset, in job_reset we only reset other innocent
>                     jobs( and unbind the scheduler fence callback for
>                     them), after hw fence
>                     Forcely set to drv_seq, all hw fence are signaled
>                     (this is the way of original logic, I didn't
>                     change that).  When go to sched_recovery(), it
>                     will recover all innocent job and hook
>                     The scheduler fence with new hw fence.  That way
>                     only the guilty job is dropped forever.
>
>                     -----Original Message-----
>                     From: Christian KÃ¶nig
>                     [mailto:deathsimple at vodafone.de]
>                     Sent: Monday, May 08, 2017 9:12 PM
>                     To: Liu, Monk <Monk.Liu at amd.com>
>                     <mailto:Monk.Liu at amd.com>;
>                     amd-gfx at lists.freedesktop.org
>                     <mailto:amd-gfx at lists.freedesktop.org>
>                     Cc: Koenig, Christian <Christian.Koenig at amd.com>
>                     <mailto:Christian.Koenig at amd.com>
>                     Subject: Re: [PATCH 4/4]
>                     drm/amdgpu/SRIOV:implement guilty job TDR (V2)
>
>                     Am 08.05.2017 um 09:01 schrieb Liu, Monk:
>                     > @Christian
>                     >
>                     > This one is changed to guilty job scheme
>                     accordingly with your
>                     > response
>                     >
>                     > BR Monk
>                     >
>                     > -----Original Message-----
>                     > From: Monk Liu [mailto:Monk.Liu at amd.com]
>                     > Sent: Monday, May 08, 2017 3:00 PM
>                     > To: amd-gfx at lists.freedesktop.org
>                     <mailto:amd-gfx at lists.freedesktop.org>
>                     > Cc: Liu, Monk <Monk.Liu at amd.com>
>                     <mailto:Monk.Liu at amd.com>
>                     > Subject: [PATCH] drm/amdgpu/SRIOV:implement
>                     guilty job TDR (V2)
>                     >
>                     > 1,TDR will kickout guilty job if it hang exceed
>                     the threshold of the given one from kernel
>                     paramter "job_hang_limit", that way a bad command
>                     stream will not infinitly cause GPU hang.
>                     >
>                     > by default this threshold is 1 so a job will be
>                     kicked out after it hang.
>                     >
>                     > 2,if a job timeout TDR routine will not reset
>                     all sched/ring, instead if will only reset on the
>                     givn one which is indicated by @job of
>                     amdgpu_sriov_gpu_reset, that way we don't need to
>                     reset and recover each sched/ring if we already
>                     know which job cause GPU hang.
>                     >
>                     > 3,unblock sriov_gpu_reset for AI family.
>                     > 4,don't init entity for KIQ
>                     >
>                     > TODO:
>                     > when a job is considered as guilty, we should
>                     mark some flag in its fence status flag, and let
>                     UMD side aware that this fence signaling is not
>                     due to job complete but job hang.
>                     >
>                     > Change-Id: I7b89c19a3de93249db570d0a80522176b1525a09
>                     > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>                     <mailto:Monk.Liu at amd.com>
>                     > ---
>                     > drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 1 +
>                     > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       |
>                     4 +++
>                     > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |
>                     36 ++++++++++++++++++++-------
>                     > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |
>                     4 +++
>                     > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |
>                     6 +++++
>                     > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h      | 1 +
>                     > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |
>                     11 +++++++-
>                     drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 7
>                     ++++++
>                     >   8 files changed, 60 insertions(+), 10 deletions(-)
>                     >
>                     > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                     > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                     > index 90a69bf..93bcea2 100644
>                     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>                     > @@ -111,6 +111,7 @@ extern int
>                     amdgpu_prim_buf_per_se;  extern int
>                     > amdgpu_pos_buf_per_se;  extern int
>                     amdgpu_cntl_sb_buf_per_se;  extern
>                     > int amdgpu_param_buf_per_se;
>                     > +extern int amdgpu_job_hang_limit;
>                     >
>                     >   #define AMDGPU_DEFAULT_GTT_SIZE_MB         
>                     3072ULL /* 3GB by default */
>                     >   #define
>                     AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS              3000
>                     > diff --git
>                     a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                     > index b4bbbb3..23afc58 100644
>                     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>                     > @@ -52,6 +52,10 @@ static int
>                     amdgpu_ctx_init(struct amdgpu_device *adev, struct
>                     amdgpu_ctx *ctx)
>                     >                struct amd_sched_rq *rq;
>                     >
>                     >                rq =
>                     &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
>                     > +
>                     > +             if (ring == &adev->gfx.kiq.ring)
>                     > +                     continue;
>                     > +
>
>                     That looks like a bug fix and should probably go
>                     into a separate patch.
>
>                     >                r =
>                     amd_sched_entity_init(&ring->sched,
>                     &ctx->rings[i].entity,
>                     > rq, amdgpu_sched_jobs);
>                     >                if (r)
>                     > diff --git
>                     a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                     > index 0e5f314..f3990fe 100644
>                     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>                     > @@ -2537,7 +2537,7 @@ static int
>                     amdgpu_recover_vram_from_shadow(struct
>                     amdgpu_device *adev,
>                     >    */
>                     >   int amdgpu_sriov_gpu_reset(struct
>                     amdgpu_device *adev, struct amdgpu_job *job)  {
>                     > -     int i, r = 0;
>                     > +     int i, j, r = 0;
>                     >        int resched;
>                     >        struct amdgpu_bo *bo, *tmp;
>                     >        struct amdgpu_ring *ring;
>                     > @@ -2550,19 +2550,30 @@ int
>                     amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
>                     struct amdgpu_job *job)
>                     >        /* block TTM */
>                     >        resched =
>                     ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>                     >
>                     > -     /* block scheduler */
>                     > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                     > -             ring = adev->rings[i];
>                     > +     /* we start from the ring trigger GPU hang */
>                     > +     j = job ? job->ring->idx : 0;
>                     > +
>                     > +     if (job)
>                     > +             if
>                     (amd_sched_invalidate_job(&job->base,
>                     amdgpu_job_hang_limit))
>                     > + amd_sched_job_kickout(&job->base);
>
>                     Well that looks like the wrong order to me. We
>                     should probably stop the scheduler before trying
>                     to mess anything with the job.
>
>                     >
>                     > +     /* block scheduler */
>                     > +     for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                     > +             ring = adev->rings[i %
>                     AMDGPU_MAX_RINGS];
>                     >                if (!ring || !ring->sched.thread)
>                     >                        continue;
>                     >
>                     > kthread_park(ring->sched.thread);
>                     > +
>                     > +             if (job && j != i)
>                     > +                     continue;
>                     > +
>                     > +             /* only do job_reset on the hang
>                     ring if @job not NULL */
>                     > amd_sched_hw_job_reset(&ring->sched);
>                     > -     }
>                     >
>                     > -     /* after all hw jobs are reset, hw fence
>                     is meaningless, so force_completion */
>                     > - amdgpu_fence_driver_force_completion(adev);
>                     > +             /* after all hw jobs are reset, hw
>                     fence is meaningless, so force_completion */
>                     > + amdgpu_fence_driver_force_completion_ring(ring);
>                     > +     }
>                     >
>                     >        /* request to take full control of GPU
>                     before re-initialization  */
>                     >        if (job)
>                     > @@ -2615,11 +2626,16 @@ int
>                     amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
>                     struct amdgpu_job *job)
>                     >        }
>                     >        fence_put(fence);
>                     >
>                     > -     for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                     > -             struct amdgpu_ring *ring =
>                     adev->rings[i];
>                     > +     for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>                     > +             ring = adev->rings[i %
>                     AMDGPU_MAX_RINGS];
>                     >                if (!ring || !ring->sched.thread)
>                     >                        continue;
>                     >
>                     > +             if (job && j != i) {
>                     > + kthread_unpark(ring->sched.thread);
>                     > +                     continue;
>                     > +             }
>                     > +
>
>                     Please split up that patch a bit further. E.g.
>                     first the handling to only hw_job_reset the ring
>                     in question, then the kickout handling.
>
>                     > amd_sched_job_recovery(&ring->sched);
>                     > kthread_unpark(ring->sched.thread);
>                     >        }
>                     > @@ -2629,6 +2645,8 @@ int
>                     amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
>                     struct amdgpu_job *job)
>                     >        if (r) {
>                     >                /* bad news, how to tell it to
>                     userspace ? */
>                     >                dev_info(adev->dev, "GPU reset
>                     failed\n");
>                     > +     } else {
>                     > +             dev_info(adev->dev, "GPU reset
>                     successed!\n");
>                     >        }
>                     >
>                     >        adev->gfx.in_reset = false;
>                     > diff --git
>                     a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                     > index 416908a..fd3691a8 100644
>                     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>                     > @@ -112,6 +112,7 @@ int amdgpu_prim_buf_per_se =
>                     0;  int
>                     > amdgpu_pos_buf_per_se = 0;  int
>                     amdgpu_cntl_sb_buf_per_se = 0;  int
>                     > amdgpu_param_buf_per_se = 0;
>                     > +int amdgpu_job_hang_limit = 0;
>                     >
>                     >   MODULE_PARM_DESC(vramlimit, "Restrict VRAM for
>                     testing, in
>                     > megabytes");  module_param_named(vramlimit,
>                     amdgpu_vram_limit, int,
>                     > 0600); @@ -237,6 +238,9 @@
>                     module_param_named(cntl_sb_buf_per_se,
>                     > amdgpu_cntl_sb_buf_per_se, int, 0444);
>                     > MODULE_PARM_DESC(param_buf_per_se, "the size of
>                     Off-Chip Pramater
>                     > Cache per Shader Engine (default depending on
>                     gfx)");
>                     > module_param_named(param_buf_per_se,
>                     amdgpu_param_buf_per_se, int,
>                     > 0444);
>                     >
>                     > +MODULE_PARM_DESC(job_hang_limit, "how much time
>                     allow a job hang and
>                     > +not drop it (default 0)");
>                     module_param_named(job_hang_limit,
>                     > +amdgpu_job_hang_limit, int ,0444);
>                     > +
>                     >
>                     >   static const struct pci_device_id pciidlist[]
>                     = {  #ifdef
>                     > CONFIG_DRM_AMDGPU_SI diff --git
>                     > a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                     > index d7523d1..8de3bd3 100644
>                     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>                     > @@ -541,6 +541,12 @@ void
>                     amdgpu_fence_driver_force_completion(struct
>                     amdgpu_device *adev)
>                     >        }
>                     >   }
>                     >
>                     > +void
>                     amdgpu_fence_driver_force_completion_ring(struct
>                     amdgpu_ring
>                     > +*ring) {
>                     > +     if (ring)
>                     > +             amdgpu_fence_write(ring,
>                     ring->fence_drv.sync_seq); }
>                     > +
>
>                     The coding style is completely off.
>
>                     Additional to that I don't think that this is a
>                     good idea. We should probably rather just signal
>                     all scheduler fences instead and don't touch the
>                     hardware fence at all.
>
>                     >   /*
>                     >    * Common fence implementation
>                     >    */
>                     > diff --git
>                     a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                     > index 981ef08..03e88c6 100644
>                     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>                     > @@ -76,6 +76,7 @@ struct amdgpu_fence_driver {  int
>                     > amdgpu_fence_driver_init(struct amdgpu_device
>                     *adev);  void
>                     > amdgpu_fence_driver_fini(struct amdgpu_device
>                     *adev);  void
>                     > amdgpu_fence_driver_force_completion(struct
>                     amdgpu_device *adev);
>                     > +void
>                     amdgpu_fence_driver_force_completion_ring(struct
>                     amdgpu_ring
>                     > +*ring);
>                     >
>                     >   int amdgpu_fence_driver_init_ring(struct
>                     amdgpu_ring *ring,
>                     >                                  unsigned
>                     num_hw_submission);
>                     > diff --git
>                     a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                     > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                     > index 6f4e31f..4e97e6d 100644
>                     > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                     > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>                     > @@ -390,9 +390,18 @@ void
>                     amd_sched_hw_job_reset(struct amd_gpu_scheduler
>                     *sched)
>                     > &s_job->s_fence->cb)) {
>                     > fence_put(s_job->s_fence->parent);
>                     > s_job->s_fence->parent = NULL;
>                     > + atomic_dec(&sched->hw_rq_count);
>                     >                }
>                     >        }
>                     > - atomic_set(&sched->hw_rq_count, 0);
>                     > + spin_unlock(&sched->job_list_lock);
>                     > +}
>                     > +
>                     > +void amd_sched_job_kickout(struct amd_sched_job
>                     *s_job) {
>                     > +     struct amd_gpu_scheduler *sched =
>                     s_job->sched;
>                     > +
>                     > + spin_lock(&sched->job_list_lock);
>                     > +     list_del_init(&s_job->node);
>                     > spin_unlock(&sched->job_list_lock);
>                     >   }
>                     >
>                     > diff --git
>                     a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                     > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                     > index 8cb41d3..59694f3 100644
>                     > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                     > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>                     > @@ -81,6 +81,7 @@ struct amd_sched_job {
>                     >        struct list_head node;
>                     >        struct delayed_work work_tdr;
>                     >        uint64_t                        id;
>                     > +     atomic_t karma;
>                     >   };
>                     >
>                     >   extern const struct fence_ops
>                     amd_sched_fence_ops_scheduled; @@ -96,6 +97,11 @@
>                     static inline struct amd_sched_fence
>                     *to_amd_sched_fence(struct fence *f)
>                     >        return NULL;
>                     >   }
>                     >
>                     > +static inline bool
>                     amd_sched_invalidate_job(struct amd_sched_job
>                     > +*s_job, int threshold) {
>                     > +     return (s_job &&
>                     atomic_inc_return(&s_job->karma) > threshold); }
>                     > +
>
>                     Again coding style is completely off.
>
>                     Christian.
>
>                     >   /**
>                     >    * Define the backend operations called by the
>                     scheduler,
>                     >    * these functions should be implemented in
>                     driver side @@ -158,4 +164,5 @@ int
>                     amd_sched_job_init(struct amd_sched_job *job,
>                     >                       void *owner);
>                     >   void amd_sched_hw_job_reset(struct
>                     amd_gpu_scheduler *sched);  void
>                     > amd_sched_job_recovery(struct amd_gpu_scheduler
>                     *sched);
>                     > +void amd_sched_job_kickout(struct amd_sched_job
>                     *s_job);
>                     >   #endif
>                     > --
>                     > 2.7.4
>                     >
>                     > _______________________________________________
>                     > amd-gfx mailing list
>                     > amd-gfx at lists.freedesktop.org
>                     <mailto:amd-gfx at lists.freedesktop.org>
>                     >
>                     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
>
>
>
>
>
>
>             _______________________________________________
>
>             amd-gfx mailing list
>
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>
>             https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170510/ab4ce273/attachment-0001.html>