Am 10.05.2017 um 13:50 schrieb Liu, Monk: > > >Because stopping all the scheduler threads takes a moment and it is > entirely possible that the job finishes within that time. > > Sounds reasonable, but instead not, because if so why not increase > timed_out value to + 0.5 sec ? that should cover the time on stop all > schedulers, > That won't work, see you need to stop the scheduler to figure out if the job is actually hung. > So the point is we should rely on and trust timedout parm since it is > the only way to let kernel side aware GPU may hang â?¦ > > If there is a way to detect GPU hang (actually we have such method, > implemented in RLCV side) in kernel driver side, that will convince me > > To check the fence at least one time, like: > > Void Gpu_reset() > > { > > Hang = Check_gpu_hang_or_busy(); > > If (!hang) { > > //looks like the job is still running, we can let it run â?¦ > > Return; > > } else { > > Signaled = fence_signaled(job); > > If (signaled) > > Return; //do nothing > > } > > â?¦ > > } > Well what we actually need to do is to stop the hardware from doing any processing, but that's easier said than done. > I can say that we may still have chance to implement this > hang_detection in kernel driver side, but it is not easy so currently > I only trust timedout event > I still don't think that this is a good idea. At least on bare metal we need more checks than that to determine if we really should reset the hardware or not. Regards, Christian. > BR Monk > > *From:*Christian König [mailto:deathsimple at vodafone.de] > *Sent:* Wednesday, May 10, 2017 7:14 PM > *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian > <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org > *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2) > > You still avoid my question: whatâ??s the theoretical backend you > that you think check once instead of twice or even more is > good**before** hw_job_reset() ? > > Because stopping all the scheduler threads takes a moment and it is > entirely possible that the job finishes within that time. > > > 1) if you check the fence and found it not signaled, then you will > call hw_job_reset(), but there is still chance that between your > check and the hw_job_reset() the > > Sched fence could signaled , isnâ??t it ? you still cannot avoid > such race condition > > Crap, you're right. We would indeed need to check twice and that > wouldn't be consistent anyway. > > Once after stopping the schedulers and before hw_job_reset() because > then we can just start the schedulers again and continue as if nothing > has happened. > > And another time after calling hw_job_reset() if we want to set the > error code. > > > Donâ??t forget your approach still have chance to hit the race > condition, and to me I donâ??t think the race condition matters > thatâ??s why I donâ??t even consider it > > Yeah, you convinced me. Please go ahead with the current approach, but > at least add a comment that we might want to improve that. > > Regards, > Christian. > > Am 10.05.2017 um 13:02 schrieb Liu, Monk: > > > Checking a second time is pointless since it can't signal any > more after calling amd_sched_hw_job_reset(). > > [ML] you seems not response to me â?¦ I of cause know fence cannot > signal after hw_job_reset() â?¦. > > My question is , before you call hw_job_reset(), why you want to > check the fence ? why not check twice ? > > You still avoid my question: whatâ??s the theoretical backend you > that you think check once instead of twice or even more is > good**before** hw_job_reset() ? > > >No, the timeout is pretty meaningless. It's just the trigger that > we need to do something. > > 2) And even it signaled after entering gpu_reset(), it will > automatically done like normal cases, thatâ??s good. Why remove > those callback instead ? > > > No, that's absolutely not good. We don't know if it's the > hardware which results in the job being signaled or our reset code. > > [ML] you are wrong, for SR-IOV case, the timeout is all that > matters, because one VF can only have such time slice within > timeout, and Iâ??m doing the TDR on SR-IOV so donâ??t > > Always looks things with bare-metal mind > > >Otherwise we have a race condition here where we can't determine > if the reset finished the job or if it did just on it's own while > we stopped the scheduler. > > [ML] you are wrong : > 1) if you check the fence and found it not signaled, then you will > call hw_job_reset(), but there is still chance that between your > check and the hw_job_reset() the > > Sched fence could signaled , isnâ??t it ? you still cannot avoid > such race condition > > 2) I donâ??t care if the fence is signaled due to its own finish or > because we force the hw fence signaled, for SR-IOV case, as long > as the job exceeds timeout, we consider > > It hang. > > 3) Even for bare-metal case, you still cannot sure if the fence is > signaled due to its own or hw_fence force signal, reason is in #1) > > >No, you are insist on a vague rules not strict, like I said, what is > the theoretic to backend your approach that only check once on the > in question job ? why not check again if not signaled ? > > >Because we have removed the connection between the job and the > hardware fence and because of this the job can never signal. > > [ML] like I said, before you call hw_job_reset(), you only check > on the job once, why not check again and again ? I donâ??t see you > have a reason to only check once, and > > If you donâ??t have such reason I think you should not check at all. > > If you have such reason, prove me that only check once is good and > enough. > > Donâ??t forget your approach still have chance to hit the race > condition, and to me I donâ??t think the race condition matters > thatâ??s why I donâ??t even consider it > > BR Monk > > *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On > Behalf Of *Christian König > *Sent:* Wednesday, May 10, 2017 6:26 PM > *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>; > Koenig, Christian <Christian.Koenig at amd.com> > <mailto:Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job > TDR (V2) > > Am 10.05.2017 um 12:05 schrieb Liu, Monk: > > [ML] yes, but we cannot guarantee the job is 100% really hang > when entering gpu_reset(), we can only trust our > amdgpu_job_timeout as a deadline for each job. > > You approach that check the fence first before charge it as > guilty/hang is incorrect looks to me because why you not check > it twice, triple, and even more loops ? > > Because the job can't signal any more after calling > amd_sched_hw_job_reset(). > > [ML] No â?¦ thatâ??s where I think your approach is vague: > > 1) see that you check after scheduler stopped, see if job > signaled, my question is if the job is not signaled (like most > usual case) > > Why you not check it again and again ? maybe the second time > you will find it signaled â?¦ > > > Checking a second time is pointless since it can't signal any more > after calling amd_sched_hw_job_reset(). > > > > My point is the checking here is meaningless, we already have > timedout for the guard. > > No, the timeout is pretty meaningless. It's just the trigger that > we need to do something. > > But to determine what to do we first need to stop the scheduler, > remove the hardware fence and THEN check the current status. > > Otherwise we have a race condition here where we can't determine > if the reset finished the job or if it did just on it's own while > we stopped the scheduler. > > > > 2) And even it signaled after entering gpu_reset(), it will > automatically done like normal cases, thatâ??s good. Why remove > those callback instead ? > > No, that's absolutely not good. We don't know if it's the hardware > which results in the job being signaled or our reset code. > > > > > > > So I refuse to check if @job is just signaled in > gpu_reset, because this action is vague (and no one can > guarantee the job wonâ??t signal during gpu_reset, we should > not argue on this event â?¦), I prefer clean and restrict rules. > > Yeah, completely agree that we need to have struct rules for > that. That's why I insists on doing this :) > > No, you are insist on a vague rules not strict, like I said, > what is the theoretic to backend your approach that only check > once on the in question job ? why not check again if not > signaled ? > > Because we have removed the connection between the job and the > hardware fence and because of this the job can never signal. > > Regards, > Christian. > > > > I donâ??t agree this approach is clean and strict. You are abuse > timedout parameter. > > > > See I just want to avoid problems for the case that the job > signaled while we stop the scheduler (because stopping the > scheduler actually can take a moment). > > Because when this happened the scheduler could already have > pushed the next job to the hardware and then we abort it with > the GPU reset and might create more problems than we solve. > > > > > > > [ML] I donâ??t see my approach will have chance to fence > twiceâ?¦ on the contrast I think my approach is more clear: > no matter the in question job finally signaled or not, I > just kick it out from mirror-list > > Without remove the callback from hw fence, that way even > it really signaled during the gpu_reset() period the logic > is still perfect and its sched fence will act like usual â?¦ > > We want to set an error code on the job before signaling it > don't we? So we need to be sure how and when the job is > signaled as finished. > > I mean letting it signal when we force the hardware fence to > complete will work as well, but I still think that this isn't > as clean as signaling it manually. > > Please also see the helper function the Intel guys introduced > drm_fence_set_error(), we will run into a BUG_ON if we can't > guarantee the order of execution here. > > Regards, > Christian. > > Am 10.05.2017 um 06:00 schrieb Liu, Monk: > > Christian, > > Looks like we need more discuss with itâ?¦ > > Here is your approach: > > 1. Stop the scheduler from feeding more jobs to the > hardware when a jobs completes. //this is where I agree > with you > > 2. Then call hw_job_reset to remove the connection between > job and hardware fence. > > 3. Test if job is now completed. It is entirely possible > that the job completed while we started the work to reset > the hardware. > > Removing the connection between the hardware fence and the > job is the deadline, if the job completes after that it is > lost. > > 4. Check the karma and kick out the job if we found it guilty. > > 5. Get the whole stuff working again, e.g. reset the > hardware, restart the scheduler etc... > > */[ML]: One thing I agree to change with your way: in > gpu_reset() we should first stop the in question ringâ??s > scheduler (not the all) before kick out the guilty job./* > > > Indeed, but I still think that this is a bad approach > cause we then reset the hardware without a good reason. > > [ML] yes, but we cannot guarantee the job is 100% really > hang when entering gpu_reset(), we can only trust our > amdgpu_job_timeout as a deadline for each job. > > You approach that check the fence first before charge it > as guilty/hang is incorrect looks to me because why you > not check it twice, triple, and even more loops ? > > You check it one time and you found it just signaled > thatâ??s great and lucky(really luckyâ?¦), But what if it > didnâ??t signaled (like most usual case) , why not check it > again and again ? do you have a theoretic to support on > how much time you need to check before finally consider it > hang ? No I donâ??t think you have so please just cut this > unnecessary checking, we already use amdgpu_job_timeout to > give the deadline of each job. > > So I refuse to check if @job is just signaled in > gpu_reset, because this action is vague (and no one can > guarantee the job wonâ??t signal during gpu_reset, we should > not argue on this event â?¦), I prefer clean and restrict rules. > > >Force completion is not so much of the issue, but rather > in which order you do things. > > >See the original code first stops the scheduler and > removes the connection between hardware fence and job in > an atomic manner. And THEN forces the hardware fence to > complete. > > >This way we could be sure that nothing happened in > parallel, e.g. that we don't try to signal the fence twice > or something like that. > > [ML] I donâ??t see my approach will have chance to fence > twiceâ?¦ on the contrast I think my approach is more clear: > no matter the in question job finally signaled or not, I > just kick it out from mirror-list > > Without remove the callback from hw fence, that way even > it really signaled during the gpu_reset() period the logic > is still perfect and its sched fence will act like usual â?¦ > > Please point out where or how my approach will go wrong > like â??e.g. that we don't try to signal the fence twice or > something like that.â??, otherwise I cannot be persuaded and > fix my way â?¦ > > At last. I run the TDR test and it ends up with Hypervisor > side error, the guest side is all perfect, here is what I ran: > > The vk_example and vulkan CTS test under MCBP case, we > have all kinds of hang on compute ring, without TDR this > test wonâ??t suffer for 5 seconds, and with TDR although > MCBP is buggy now but we can > > Finish this test (of cause test result is mismatch due to > MCBP issue), and there are tongs of job_timed_out in > dmesg, but guest driver didnâ??t have any error report. I > was also surprised this behave really stable â?¦ > > The second test the using â??watchâ?? to trigger a gpu hang > (bad command stream) every 2 seconds, with > amdgpu_job_timeout also set to 2seconds, I can run it till > hypervisor side hit VF FLR error, and guest side > > Still runs perfectly and nothing wrong happened, with hw > fence seq number I can tell there are 3000 loops of TDR > finished before Hypervisor hit error. > > BR Monk > > *From:*Koenig, Christian > *Sent:* Tuesday, May 09, 2017 8:52 PM > *To:* Liu, Monk <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com>; Christian König > <deathsimple at vodafone.de> > <mailto:deathsimple at vodafone.de>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement > guilty job TDR (V2) > > [ML] if the job complete, the jobâ??s sched fence > callback will take this spin_lock and remove itself > from mirror_list, so we are still safe to call > amd_sched_job_kickout(), and it will do nothing if so > > Indeed, but I still think that this is a bad approach > cause we then reset the hardware without a good reason. > > > > > > Besides, original logic also force complete the hw > fence, and it works well â?¦ > > Force completion is not so much of the issue, but rather > in which order you do things. > > See the original code first stops the scheduler and > removes the connection between hardware fence and job in > an atomic manner. And THEN forces the hardware fence to > complete. > > This way we could be sure that nothing happened in > parallel, e.g. that we don't try to signal the fence twice > or something like that. > > > > > > State like â??You are missing that it is entirely > possible that the job will complete while we are > trying to kick it out.â?? > > Sorry I should have been more clear. > > > > > > Is not a good reason to reject my approach, because > that is okay if the job just completed â?¦ > > We usually try to take a defensive approach, so stopping > everything, removing the hardware fence connection and > then explicitly kicking out the job in question sounds > much better than doing it implicitly with the hardware > fence completion. > > Even when this works (which I'm still not sure of) that is > a really awkward and hard to understand approach. > > Regards, > Christian. > > Am 09.05.2017 um 13:58 schrieb Liu, Monk: > > You are missing that it is entirely possible that the > job will complete while we are trying to kick it out. > > [ML] if the job complete, the jobâ??s sched fence > callback will take this spin_lock and remove itself > from mirror_list, so we are still safe to call > amd_sched_job_kickout(), and it will do nothing if so > > Please go through the whole steps again, > > Besides, original logic also force complete the hw > fence, and it works well â?¦ > > I donâ??t see the solid reason why you insist on your > approach, please go through the steps again and give > me the details about where is incorrect than I can fix it > > State like â??You are missing that it is entirely > possible that the job will complete while we are > trying to kick it out.â?? Is not a good reason to reject > my approach, because that is okay if the job just > completed â?¦ > > BR Monk > > > > > > *From:*Koenig, Christian > *Sent:* Tuesday, May 09, 2017 3:49 PM > *To:* Liu, Monk <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com>; Christian König > <deathsimple at vodafone.de> > <mailto:deathsimple at vodafone.de>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement > guilty job TDR (V2) > > [ML] Really not necessary, we have spin_lock to > protect the mirror-list, nothing will be messed up ... > > You are missing that it is entirely possible that the > job will complete while we are trying to kick it out. > > > > > > > [ML] why don't touch hardware fence at all ? the > original/bare-metal gpu reset also signal all > ring's hardware fence first, I just follow the > original logic ... > Scheduler fence will be auto signaled after hw > fence signaled, any problem with that ? what's the > concern ? > > The hardware runs async to the CPU which tries to > reset it, so we need to be careful in which order > things are done. > > > > > > > [ML] No I don't think so, the kickout must be > prior to the hw_job_reset, otherwise the scheduler > fence callback will be removed and you need > manually install it later , which is not correct: > For the guity job, we just kick it out before job > reset, in job_reset we only reset other innocent > jobs( and unbind the scheduler fence callback for > them), after hw fence > Forcely set to drv_seq, all hw fence are signaled > (this is the way of original logic, I didn't > change that). When go to sched_recovery(), it > will recover all innocent job and hook > The scheduler fence with new hw fence. That way > only the guilty job is dropped forever. > > Again same problem here. > > To be absolutely sure that everything works as > expected we need to do it in the following order: > > 1. Stop the scheduler from feeding more jobs to the > hardware when a jobs completes. > > 2. Then call hw_job_reset to remove the connection > between job and hardware fence. > > 3. Test if job is now completed. It is entirely > possible that the job completed while we started the > work to reset the hardware. > > Removing the connection between the hardware fence and > the job is the deadline, if the job completes after > that it is lost. > > 4. Check the karma and kick out the job if we found it > guilty. > > 5. Get the whole stuff working again, e.g. reset the > hardware, restart the scheduler etc... > > Regards, > Christian. > > Am 09.05.2017 um 04:45 schrieb Liu, Monk: > > > > > - /* block scheduler */ > > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > > - ring = adev->rings[i]; > > + /* we start from the ring trigger GPU hang */ > > + j = job ? job->ring->idx : 0; > > + > > + if (job) > > + if > (amd_sched_invalidate_job(&job->base, > amdgpu_job_hang_limit)) > > + amd_sched_job_kickout(&job->base); > > Well that looks like the wrong order to me. We > should probably stop the scheduler before trying > to mess anything with the job. > > [ML] Really not necessary, we have spin_lock to > protect the mirror-list, nothing will be messed up ... > > > > > > +void > amdgpu_fence_driver_force_completion_ring(struct > amdgpu_ring > > +*ring) { > > + if (ring) > > + amdgpu_fence_write(ring, > ring->fence_drv.sync_seq); } > > + > > The coding style is completely off. > > [ML] I don't know why at email side it looks wrong > coding style, but I'm sure it is correct at my > side, check here: > void > amdgpu_fence_driver_force_completion_ring(struct > amdgpu_ring *ring) > { > if (ring) > amdgpu_fence_write(ring, > ring->fence_drv.sync_seq); > } > > Additional to that I don't think that this is a > good idea. We should probably rather just signal > all scheduler fences instead and don't touch the > hardware fence at all. > > [ML] why don't touch hardware fence at all ? the > original/bare-metal gpu reset also signal all > ring's hardware fence first, I just follow the > original logic ... > Scheduler fence will be auto signaled after hw > fence signaled, any problem with that ? what's the > concern ? > > > > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > > - struct amdgpu_ring *ring = > adev->rings[i]; > > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) { > > + ring = adev->rings[i % > AMDGPU_MAX_RINGS]; > > if (!ring || !ring->sched.thread) > > continue; > > > > + if (job && j != i) { > > + kthread_unpark(ring->sched.thread); > > + continue; > > + } > > + > > Please split up that patch a bit further. E.g. > first the handling to only hw_job_reset the ring > in question, then the kickout handling. > > [ML] No I don't think so, the kickout must be > prior to the hw_job_reset, otherwise the scheduler > fence callback will be removed and you need > manually install it later , which is not correct: > For the guity job, we just kick it out before job > reset, in job_reset we only reset other innocent > jobs( and unbind the scheduler fence callback for > them), after hw fence > Forcely set to drv_seq, all hw fence are signaled > (this is the way of original logic, I didn't > change that). When go to sched_recovery(), it > will recover all innocent job and hook > The scheduler fence with new hw fence. That way > only the guilty job is dropped forever. > > -----Original Message----- > From: Christian König > [mailto:deathsimple at vodafone.de] > Sent: Monday, May 08, 2017 9:12 PM > To: Liu, Monk <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > Cc: Koenig, Christian <Christian.Koenig at amd.com> > <mailto:Christian.Koenig at amd.com> > Subject: Re: [PATCH 4/4] > drm/amdgpu/SRIOV:implement guilty job TDR (V2) > > Am 08.05.2017 um 09:01 schrieb Liu, Monk: > > @Christian > > > > This one is changed to guilty job scheme > accordingly with your > > response > > > > BR Monk > > > > -----Original Message----- > > From: Monk Liu [mailto:Monk.Liu at amd.com] > > Sent: Monday, May 08, 2017 3:00 PM > > To: amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > > Cc: Liu, Monk <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com> > > Subject: [PATCH] drm/amdgpu/SRIOV:implement > guilty job TDR (V2) > > > > 1,TDR will kickout guilty job if it hang exceed > the threshold of the given one from kernel > paramter "job_hang_limit", that way a bad command > stream will not infinitly cause GPU hang. > > > > by default this threshold is 1 so a job will be > kicked out after it hang. > > > > 2,if a job timeout TDR routine will not reset > all sched/ring, instead if will only reset on the > givn one which is indicated by @job of > amdgpu_sriov_gpu_reset, that way we don't need to > reset and recover each sched/ring if we already > know which job cause GPU hang. > > > > 3,unblock sriov_gpu_reset for AI family. > > 4,don't init entity for KIQ > > > > TODO: > > when a job is considered as guilty, we should > mark some flag in its fence status flag, and let > UMD side aware that this fence signaling is not > due to job complete but job hang. > > > > Change-Id: I7b89c19a3de93249db570d0a80522176b1525a09 > > Signed-off-by: Monk Liu <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | > 4 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | > 36 ++++++++++++++++++++------- > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | > 4 +++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | > 6 +++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + > > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | > 11 +++++++- > drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 7 > ++++++ > > 8 files changed, 60 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index 90a69bf..93bcea2 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -111,6 +111,7 @@ extern int > amdgpu_prim_buf_per_se; extern int > > amdgpu_pos_buf_per_se; extern int > amdgpu_cntl_sb_buf_per_se; extern > > int amdgpu_param_buf_per_se; > > +extern int amdgpu_job_hang_limit; > > > > #define AMDGPU_DEFAULT_GTT_SIZE_MB > 3072ULL /* 3GB by default */ > > #define > AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000 > > diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > index b4bbbb3..23afc58 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > @@ -52,6 +52,10 @@ static int > amdgpu_ctx_init(struct amdgpu_device *adev, struct > amdgpu_ctx *ctx) > > struct amd_sched_rq *rq; > > > > rq = > &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL]; > > + > > + if (ring == &adev->gfx.kiq.ring) > > + continue; > > + > > That looks like a bug fix and should probably go > into a separate patch. > > > r = > amd_sched_entity_init(&ring->sched, > &ctx->rings[i].entity, > > rq, amdgpu_sched_jobs); > > if (r) > > diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 0e5f314..f3990fe 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -2537,7 +2537,7 @@ static int > amdgpu_recover_vram_from_shadow(struct > amdgpu_device *adev, > > */ > > int amdgpu_sriov_gpu_reset(struct > amdgpu_device *adev, struct amdgpu_job *job) { > > - int i, r = 0; > > + int i, j, r = 0; > > int resched; > > struct amdgpu_bo *bo, *tmp; > > struct amdgpu_ring *ring; > > @@ -2550,19 +2550,30 @@ int > amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, > struct amdgpu_job *job) > > /* block TTM */ > > resched = > ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); > > > > - /* block scheduler */ > > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > > - ring = adev->rings[i]; > > + /* we start from the ring trigger GPU hang */ > > + j = job ? job->ring->idx : 0; > > + > > + if (job) > > + if > (amd_sched_invalidate_job(&job->base, > amdgpu_job_hang_limit)) > > + amd_sched_job_kickout(&job->base); > > Well that looks like the wrong order to me. We > should probably stop the scheduler before trying > to mess anything with the job. > > > > > + /* block scheduler */ > > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) { > > + ring = adev->rings[i % > AMDGPU_MAX_RINGS]; > > if (!ring || !ring->sched.thread) > > continue; > > > > kthread_park(ring->sched.thread); > > + > > + if (job && j != i) > > + continue; > > + > > + /* only do job_reset on the hang > ring if @job not NULL */ > > amd_sched_hw_job_reset(&ring->sched); > > - } > > > > - /* after all hw jobs are reset, hw fence > is meaningless, so force_completion */ > > - amdgpu_fence_driver_force_completion(adev); > > + /* after all hw jobs are reset, hw > fence is meaningless, so force_completion */ > > + amdgpu_fence_driver_force_completion_ring(ring); > > + } > > > > /* request to take full control of GPU > before re-initialization */ > > if (job) > > @@ -2615,11 +2626,16 @@ int > amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, > struct amdgpu_job *job) > > } > > fence_put(fence); > > > > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > > - struct amdgpu_ring *ring = > adev->rings[i]; > > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) { > > + ring = adev->rings[i % > AMDGPU_MAX_RINGS]; > > if (!ring || !ring->sched.thread) > > continue; > > > > + if (job && j != i) { > > + kthread_unpark(ring->sched.thread); > > + continue; > > + } > > + > > Please split up that patch a bit further. E.g. > first the handling to only hw_job_reset the ring > in question, then the kickout handling. > > > amd_sched_job_recovery(&ring->sched); > > kthread_unpark(ring->sched.thread); > > } > > @@ -2629,6 +2645,8 @@ int > amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, > struct amdgpu_job *job) > > if (r) { > > /* bad news, how to tell it to > userspace ? */ > > dev_info(adev->dev, "GPU reset > failed\n"); > > + } else { > > + dev_info(adev->dev, "GPU reset > successed!\n"); > > } > > > > adev->gfx.in_reset = false; > > diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > index 416908a..fd3691a8 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > @@ -112,6 +112,7 @@ int amdgpu_prim_buf_per_se = > 0; int > > amdgpu_pos_buf_per_se = 0; int > amdgpu_cntl_sb_buf_per_se = 0; int > > amdgpu_param_buf_per_se = 0; > > +int amdgpu_job_hang_limit = 0; > > > > MODULE_PARM_DESC(vramlimit, "Restrict VRAM for > testing, in > > megabytes"); module_param_named(vramlimit, > amdgpu_vram_limit, int, > > 0600); @@ -237,6 +238,9 @@ > module_param_named(cntl_sb_buf_per_se, > > amdgpu_cntl_sb_buf_per_se, int, 0444); > > MODULE_PARM_DESC(param_buf_per_se, "the size of > Off-Chip Pramater > > Cache per Shader Engine (default depending on > gfx)"); > > module_param_named(param_buf_per_se, > amdgpu_param_buf_per_se, int, > > 0444); > > > > +MODULE_PARM_DESC(job_hang_limit, "how much time > allow a job hang and > > +not drop it (default 0)"); > module_param_named(job_hang_limit, > > +amdgpu_job_hang_limit, int ,0444); > > + > > > > static const struct pci_device_id pciidlist[] > = { #ifdef > > CONFIG_DRM_AMDGPU_SI diff --git > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > index d7523d1..8de3bd3 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > @@ -541,6 +541,12 @@ void > amdgpu_fence_driver_force_completion(struct > amdgpu_device *adev) > > } > > } > > > > +void > amdgpu_fence_driver_force_completion_ring(struct > amdgpu_ring > > +*ring) { > > + if (ring) > > + amdgpu_fence_write(ring, > ring->fence_drv.sync_seq); } > > + > > The coding style is completely off. > > Additional to that I don't think that this is a > good idea. We should probably rather just signal > all scheduler fences instead and don't touch the > hardware fence at all. > > > /* > > * Common fence implementation > > */ > > diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > index 981ef08..03e88c6 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > @@ -76,6 +76,7 @@ struct amdgpu_fence_driver { int > > amdgpu_fence_driver_init(struct amdgpu_device > *adev); void > > amdgpu_fence_driver_fini(struct amdgpu_device > *adev); void > > amdgpu_fence_driver_force_completion(struct > amdgpu_device *adev); > > +void > amdgpu_fence_driver_force_completion_ring(struct > amdgpu_ring > > +*ring); > > > > int amdgpu_fence_driver_init_ring(struct > amdgpu_ring *ring, > > unsigned > num_hw_submission); > > diff --git > a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > index 6f4e31f..4e97e6d 100644 > > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > @@ -390,9 +390,18 @@ void > amd_sched_hw_job_reset(struct amd_gpu_scheduler > *sched) > > &s_job->s_fence->cb)) { > > fence_put(s_job->s_fence->parent); > > s_job->s_fence->parent = NULL; > > + atomic_dec(&sched->hw_rq_count); > > } > > } > > - atomic_set(&sched->hw_rq_count, 0); > > + spin_unlock(&sched->job_list_lock); > > +} > > + > > +void amd_sched_job_kickout(struct amd_sched_job > *s_job) { > > + struct amd_gpu_scheduler *sched = > s_job->sched; > > + > > + spin_lock(&sched->job_list_lock); > > + list_del_init(&s_job->node); > > spin_unlock(&sched->job_list_lock); > > } > > > > diff --git > a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > index 8cb41d3..59694f3 100644 > > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > @@ -81,6 +81,7 @@ struct amd_sched_job { > > struct list_head node; > > struct delayed_work work_tdr; > > uint64_t id; > > + atomic_t karma; > > }; > > > > extern const struct fence_ops > amd_sched_fence_ops_scheduled; @@ -96,6 +97,11 @@ > static inline struct amd_sched_fence > *to_amd_sched_fence(struct fence *f) > > return NULL; > > } > > > > +static inline bool > amd_sched_invalidate_job(struct amd_sched_job > > +*s_job, int threshold) { > > + return (s_job && > atomic_inc_return(&s_job->karma) > threshold); } > > + > > Again coding style is completely off. > > Christian. > > > /** > > * Define the backend operations called by the > scheduler, > > * these functions should be implemented in > driver side @@ -158,4 +164,5 @@ int > amd_sched_job_init(struct amd_sched_job *job, > > void *owner); > > void amd_sched_hw_job_reset(struct > amd_gpu_scheduler *sched); void > > amd_sched_job_recovery(struct amd_gpu_scheduler > *sched); > > +void amd_sched_job_kickout(struct amd_sched_job > *s_job); > > #endif > > -- > > 2.7.4 > > > > _______________________________________________ > > amd-gfx mailing list > > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > > > > > > > > _______________________________________________ > > amd-gfx mailing list > > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170510/ab4ce273/attachment-0001.html>