[AMD Official Use Only - Internal Distribution Only] Hi, Andrey, >>how u handle non guilty singnaled jobs in drm_sched_stop, currently looks like you don't call put for them and just explicitly free them as before Good point, I missed that place. Will cover that in my next patch. >>Also sched->free_guilty seems useless with the new approach. Yes, I agree. >>Do we even need the cleanup mechanism at drm_sched_get_cleanup_job with this approach... I am not quite sure about that for now, let me think about this topic today. Hi, Christian, should I add a fence and get/put to that fence rather than using an explicit refcount? And another concerns? Thanks, Jack -----Original Message----- From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> Sent: Friday, March 26, 2021 12:32 AM To: Zhang, Jack (Jian) <Jack.Zhang1@xxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig, Christian <Christian.Koenig@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>; Rob Herring <robh@xxxxxxxxxx>; Tomeu Vizoso <tomeu.vizoso@xxxxxxxxxxxxx>; Steven Price <steven.price@xxxxxxx> Subject: Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid memleak There are a few issues here like - how u handle non guilty singnaled jobs in drm_sched_stop, currently looks like you don't call put for them and just explicitly free them as before. Also sched->free_guilty seems useless with the new approach. Do we even need the cleanup mechanism at drm_sched_get_cleanup_job with this approach... But first - We need Christian to express his opinion on this since I think he opposed refcounting jobs and that we should concentrate on fences instead. Christian - can you chime in here ? Andrey On 2021-03-25 5:51 a.m., Zhang, Jack (Jian) wrote: > [AMD Official Use Only - Internal Distribution Only] > > > Hi, Andrey > > Thank you for your good opinions. > > I literally agree with you that the refcount could solve the > get_clean_up_up cocurrent job gracefully, and no need to re-insert the > > job back anymore. > > I quickly made a draft for this idea as follows: > > How do you like it? I will start implement to it after I got your > acknowledge. > > Thanks, > > Jack > > +void drm_job_get(struct drm_sched_job *s_job) > > +{ > > + kref_get(&s_job->refcount); > > +} > > + > > +void drm_job_do_release(struct kref *ref) > > +{ > > + struct drm_sched_job *s_job; > > + struct drm_gpu_scheduler *sched; > > + > > + s_job = container_of(ref, struct drm_sched_job, refcount); > > + sched = s_job->sched; > > + sched->ops->free_job(s_job); > > +} > > + > > +void drm_job_put(struct drm_sched_job *s_job) > > +{ > > + kref_put(&s_job->refcount, drm_job_do_release); > > +} > > + > > static void drm_sched_job_begin(struct drm_sched_job *s_job) > > { > > struct drm_gpu_scheduler *sched = s_job->sched; > > + kref_init(&s_job->refcount); > > + drm_job_get(s_job); > > spin_lock(&sched->job_list_lock); > > list_add_tail(&s_job->node, &sched->ring_mirror_list); > > drm_sched_start_timeout(sched); > > @@ -294,17 +316,16 @@ static void drm_sched_job_timedout(struct > work_struct *work) > > * drm_sched_cleanup_jobs. It will be reinserted back > after sched->thread > > * is parked at which point it's safe. > > */ > > - list_del_init(&job->node); > > + drm_job_get(job); > > spin_unlock(&sched->job_list_lock); > > job->sched->ops->timedout_job(job); > > - > > + drm_job_put(job); > > /* > > * Guilty job did complete and hence needs to be > manually removed > > * See drm_sched_stop doc. > > */ > > if (sched->free_guilty) { > > - job->sched->ops->free_job(job); > > sched->free_guilty = false; > > } > > } else { > > @@ -355,20 +376,6 @@ void drm_sched_stop(struct drm_gpu_scheduler > *sched, struct drm_sched_job *bad) > > - /* > > - * Reinsert back the bad job here - now it's safe as > > - * drm_sched_get_cleanup_job cannot race against us and > release the > > - * bad job at this point - we parked (waited for) any in > progress > > - * (earlier) cleanups and drm_sched_get_cleanup_job will not > be called > > - * now until the scheduler thread is unparked. > > - */ > > - if (bad && bad->sched == sched) > > - /* > > - * Add at the head of the queue to reflect it was the > earliest > > - * job extracted. > > - */ > > - list_add(&bad->node, &sched->ring_mirror_list); > > - > > /* > > * Iterate the job list from later to earlier one and either > deactive > > * their HW callbacks or remove them from mirror list if they > already > > @@ -774,7 +781,7 @@ static int drm_sched_main(void *param) > > kthread_should_stop()); > > if (cleanup_job) { > > - sched->ops->free_job(cleanup_job); > > + drm_job_put(cleanup_job); > > /* queue timeout for next job */ > > drm_sched_start_timeout(sched); > > } > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index 5a1f068af1c2..b80513eec90f 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -188,6 +188,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct > dma_fence *f); > > * to schedule the job. > > */ > > struct drm_sched_job { > > + struct kref refcount; > > struct spsc_node queue_node; > > struct drm_gpu_scheduler *sched; > > struct drm_sched_fence *s_fence; > > @@ -198,6 +199,7 @@ struct drm_sched_job { > > enum drm_sched_priority s_priority; > > struct drm_sched_entity *entity; > > struct dma_fence_cb cb; > > + > > }; > > *From:* Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > *Sent:* Friday, March 19, 2021 12:17 AM > *To:* Zhang, Jack (Jian) <Jack.Zhang1@xxxxxxx>; Christian König > <ckoenig.leichtzumerken@xxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig, Christian > <Christian.Koenig@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Deng, Emily > <Emily.Deng@xxxxxxx>; Rob Herring <robh@xxxxxxxxxx>; Tomeu Vizoso > <tomeu.vizoso@xxxxxxxxxxxxx>; Steven Price <steven.price@xxxxxxx> > *Subject:* Re: [PATCH v3] drm/scheduler re-insert Bailing job to avoid > memleak > > On 2021-03-18 6:41 a.m., Zhang, Jack (Jian) wrote: > > [AMD Official Use Only - Internal Distribution Only] > > Hi, Andrey > > Let me summarize the background of this patch: > > In TDR resubmit step “amdgpu_device_recheck_guilty_jobs, > > It will submit first jobs of each ring and do guilty job re-check. > > At that point, We had to make sure each job is in the mirror list(or > re-inserted back already). > > But we found the current code never re-insert the job to mirror list > in the 2^nd , 3^rd job_timeout thread(Bailing TDR thread). > > This not only will cause memleak of the bailing jobs. What’s more > important, the 1^st tdr thread can never iterate the bailing job and > set its guilty status to a correct status. > > Therefore, we had to re-insert the job(or even not delete node) for > bailing job. > > For the above V3 patch, the racing condition in my mind is: > > we cannot make sure all bailing jobs are finished before we do > amdgpu_device_recheck_guilty_jobs. > > Yes,that race i missed - so you say that for 2nd, baling thread who > extracted the job, even if he reinsert it right away back after driver > callback return DRM_GPU_SCHED_STAT_BAILING, there is small time slot > where the job is not in mirror list and so the 1st TDR might miss it > and not find that 2nd job is the actual guilty job, right ? But, > still this job will get back into mirror list, and since it's really > the bad job, it will never signal completion and so on the next > timeout cycle it will be caught (of course there is a starvation > scenario here if more TDRs kick in and it bails out again but this is really unlikely). > > Based on this insight, I think we have two options to solve this issue: > > 1. Skip delete node in tdr thread2, thread3, 4 … (using mutex or > atomic variable) > 2. Re-insert back bailing job, and meanwhile use semaphore in each > tdr thread to keep the sequence as expected and ensure each job > is in the mirror list when do resubmit step. > > For Option1, logic is simpler and we need only one global atomic > variable: > > What do you think about this plan? > > Option1 should look like the following logic: > > +static atomic_t in_reset; //a global atomic var for > synchronization > > static void drm_sched_process_job(struct dma_fence *f, struct > dma_fence_cb *cb); > > /** > > @@ -295,6 +296,12 @@ static void drm_sched_job_timedout(struct > work_struct *work) > > * drm_sched_cleanup_jobs. It will be reinserted > back after sched->thread > > * is parked at which point it's safe. > > */ > > + if (atomic_cmpxchg(&in_reset, 0, 1) != 0) { //skip > delete node if it’s thead1,2,3,…. > > + spin_unlock(&sched->job_list_lock); > > + drm_sched_start_timeout(sched); > > + return; > > + } > > + > > list_del_init(&job->node); > > spin_unlock(&sched->job_list_lock); > > @@ -320,6 +327,7 @@ static void drm_sched_job_timedout(struct > work_struct *work) > > spin_lock(&sched->job_list_lock); > > drm_sched_start_timeout(sched); > > spin_unlock(&sched->job_list_lock); > > + atomic_set(&in_reset, 0); //reset in_reset when the first > thread finished tdr > > } > > Technically looks like it should work as you don't access the job > pointer any longer and so no risk that if signaled it will be freed by > drm_sched_get_cleanup_job but,you can't just use one global variable > an by this bailing from TDR when different drivers run their TDR > threads in parallel, and even for amdgpu, if devices in different XGMI > hives or 2 independent devices in non XGMI setup. There should be > defined some kind of GPU reset group structure on drm_scheduler level > for which this variable would be used. > > P.S I wonder why we can't just ref-count the job so that even if > drm_sched_get_cleanup_job would delete it before we had a chance to > stop the scheduler thread, we wouldn't crash. This would avoid all the > dance with deletion and reinsertion. > > Andrey > > Thanks, > > Jack > > *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> > <mailto:amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> *On Behalf Of *Zhang, > Jack (Jian) > *Sent:* Wednesday, March 17, 2021 11:11 PM > *To:* Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > <mailto:ckoenig.leichtzumerken@xxxxxxxxx>; > dri-devel@xxxxxxxxxxxxxxxxxxxxx > <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx> <mailto:Christian.Koenig@xxxxxxx>; Liu, > Monk <Monk.Liu@xxxxxxx> <mailto:Monk.Liu@xxxxxxx>; Deng, Emily > <Emily.Deng@xxxxxxx> <mailto:Emily.Deng@xxxxxxx>; Rob Herring > <robh@xxxxxxxxxx> <mailto:robh@xxxxxxxxxx>; Tomeu Vizoso > <tomeu.vizoso@xxxxxxxxxxxxx> <mailto:tomeu.vizoso@xxxxxxxxxxxxx>; > Steven Price <steven.price@xxxxxxx> <mailto:steven.price@xxxxxxx>; > Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > <mailto:Andrey.Grodzovsky@xxxxxxx> > *Subject:* Re: [PATCH v3] drm/scheduler re-insert Bailing job to > avoid memleak > > [AMD Official Use Only - Internal Distribution Only] > > [AMD Official Use Only - Internal Distribution Only] > > Hi,Andrey, > > Good catch,I will expore this corner case and give feedback soon~ > > Best, > > Jack > > > ---------------------------------------------------------------------- > -- > > *From:*Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx > <mailto:Andrey.Grodzovsky@xxxxxxx>> > *Sent:* Wednesday, March 17, 2021 10:50:59 PM > *To:* Christian König <ckoenig.leichtzumerken@xxxxxxxxx > <mailto:ckoenig.leichtzumerken@xxxxxxxxx>>; Zhang, Jack (Jian) > <Jack.Zhang1@xxxxxxx <mailto:Jack.Zhang1@xxxxxxx>>; > dri-devel@xxxxxxxxxxxxxxxxxxxxx > <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx> > <dri-devel@xxxxxxxxxxxxxxxxxxxxx > <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > <amd-gfx@xxxxxxxxxxxxxxxxxxxxx > <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx>>; Koenig, Christian > <Christian.Koenig@xxxxxxx <mailto:Christian.Koenig@xxxxxxx>>; Liu, > Monk <Monk.Liu@xxxxxxx <mailto:Monk.Liu@xxxxxxx>>; Deng, Emily > <Emily.Deng@xxxxxxx <mailto:Emily.Deng@xxxxxxx>>; Rob Herring > <robh@xxxxxxxxxx <mailto:robh@xxxxxxxxxx>>; Tomeu Vizoso > <tomeu.vizoso@xxxxxxxxxxxxx <mailto:tomeu.vizoso@xxxxxxxxxxxxx>>; > Steven Price <steven.price@xxxxxxx <mailto:steven.price@xxxxxxx>> > *Subject:* Re: [PATCH v3] drm/scheduler re-insert Bailing job to > avoid memleak > > I actually have a race condition concern here - see bellow - > > On 2021-03-17 3:43 a.m., Christian König wrote: > > I was hoping Andrey would take a look since I'm really busy with > other > > work right now. > > > > Regards, > > Christian. > > > > Am 17.03.21 um 07:46 schrieb Zhang, Jack (Jian): > >> Hi, Andrey/Crhistian and Team, > >> > >> I didn't receive the reviewer's message from maintainers on > panfrost > >> driver for several days. > >> Due to this patch is urgent for my current working project. > >> Would you please help to give some review ideas? > >> > >> Many Thanks, > >> Jack > >> -----Original Message----- > >> From: Zhang, Jack (Jian) > >> Sent: Tuesday, March 16, 2021 3:20 PM > >> To: dri-devel@xxxxxxxxxxxxxxxxxxxxx > <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; > >> Koenig, Christian <Christian.Koenig@xxxxxxx > <mailto:Christian.Koenig@xxxxxxx>>; Grodzovsky, Andrey > >> <Andrey.Grodzovsky@xxxxxxx <mailto:Andrey.Grodzovsky@xxxxxxx>>; > Liu, Monk <Monk.Liu@xxxxxxx <mailto:Monk.Liu@xxxxxxx>>; Deng, > >> Emily <Emily.Deng@xxxxxxx <mailto:Emily.Deng@xxxxxxx>>; Rob > Herring <robh@xxxxxxxxxx <mailto:robh@xxxxxxxxxx>>; Tomeu > >> Vizoso <tomeu.vizoso@xxxxxxxxxxxxx > <mailto:tomeu.vizoso@xxxxxxxxxxxxx>>; Steven Price > <steven.price@xxxxxxx <mailto:steven.price@xxxxxxx>> > >> Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to > avoid > >> memleak > >> > >> [AMD Public Use] > >> > >> Ping > >> > >> -----Original Message----- > >> From: Zhang, Jack (Jian) > >> Sent: Monday, March 15, 2021 1:24 PM > >> To: Jack Zhang <Jack.Zhang1@xxxxxxx <mailto:Jack.Zhang1@xxxxxxx>>; > >> dri-devel@xxxxxxxxxxxxxxxxxxxxx > <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; > >> Koenig, Christian <Christian.Koenig@xxxxxxx > <mailto:Christian.Koenig@xxxxxxx>>; Grodzovsky, Andrey > >> <Andrey.Grodzovsky@xxxxxxx <mailto:Andrey.Grodzovsky@xxxxxxx>>; > Liu, Monk <Monk.Liu@xxxxxxx <mailto:Monk.Liu@xxxxxxx>>; Deng, > >> Emily <Emily.Deng@xxxxxxx <mailto:Emily.Deng@xxxxxxx>>; Rob > Herring <robh@xxxxxxxxxx <mailto:robh@xxxxxxxxxx>>; Tomeu > >> Vizoso <tomeu.vizoso@xxxxxxxxxxxxx > <mailto:tomeu.vizoso@xxxxxxxxxxxxx>>; Steven Price > <steven.price@xxxxxxx <mailto:steven.price@xxxxxxx>> > >> Subject: RE: [PATCH v3] drm/scheduler re-insert Bailing job to > avoid > >> memleak > >> > >> [AMD Public Use] > >> > >> Hi, Rob/Tomeu/Steven, > >> > >> Would you please help to review this patch for panfrost driver? > >> > >> Thanks, > >> Jack Zhang > >> > >> -----Original Message----- > >> From: Jack Zhang <Jack.Zhang1@xxxxxxx <mailto:Jack.Zhang1@xxxxxxx>> > >> Sent: Monday, March 15, 2021 1:21 PM > >> To: dri-devel@xxxxxxxxxxxxxxxxxxxxx > <mailto:dri-devel@xxxxxxxxxxxxxxxxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; > >> Koenig, Christian <Christian.Koenig@xxxxxxx > <mailto:Christian.Koenig@xxxxxxx>>; Grodzovsky, Andrey > >> <Andrey.Grodzovsky@xxxxxxx <mailto:Andrey.Grodzovsky@xxxxxxx>>; > Liu, Monk <Monk.Liu@xxxxxxx <mailto:Monk.Liu@xxxxxxx>>; Deng, > >> Emily <Emily.Deng@xxxxxxx <mailto:Emily.Deng@xxxxxxx>> > >> Cc: Zhang, Jack (Jian) <Jack.Zhang1@xxxxxxx > <mailto:Jack.Zhang1@xxxxxxx>> > >> Subject: [PATCH v3] drm/scheduler re-insert Bailing job to avoid > memleak > >> > >> re-insert Bailing jobs to avoid memory leak. > >> > >> V2: move re-insert step to drm/scheduler logic > >> V3: add panfrost's return value for bailing jobs in case it hits > the > >> memleak issue. > >> > >> Signed-off-by: Jack Zhang <Jack.Zhang1@xxxxxxx > <mailto:Jack.Zhang1@xxxxxxx>> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++-- > >> drivers/gpu/drm/panfrost/panfrost_job.c | 4 ++-- > >> drivers/gpu/drm/scheduler/sched_main.c | 8 +++++++- > >> include/drm/gpu_scheduler.h | 1 + > >> 5 files changed, 19 insertions(+), 6 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> index 79b9cc73763f..86463b0f936e 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> @@ -4815,8 +4815,10 @@ int amdgpu_device_gpu_recover(struct > >> amdgpu_device *adev, > >> job ? job->base.id : -1); > >> /* even we skipped this reset, still need to set the > job > >> to guilty */ > >> - if (job) > >> + if (job) { > >> drm_sched_increase_karma(&job->base); > >> + r = DRM_GPU_SCHED_STAT_BAILING; > >> + } > >> goto skip_recovery; > >> } > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >> index 759b34799221..41390bdacd9e 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >> @@ -34,6 +34,7 @@ static enum drm_gpu_sched_stat > >> amdgpu_job_timedout(struct drm_sched_job *s_job) > >> struct amdgpu_job *job = to_amdgpu_job(s_job); > >> struct amdgpu_task_info ti; > >> struct amdgpu_device *adev = ring->adev; > >> + int ret; > >> memset(&ti, 0, sizeof(struct amdgpu_task_info)); > >> @@ -52,8 +53,11 @@ static enum drm_gpu_sched_stat > >> amdgpu_job_timedout(struct drm_sched_job *s_job) > >> ti.process_name, ti.tgid, ti.task_name, ti.pid); > >> if (amdgpu_device_should_recover_gpu(ring->adev)) { > >> - amdgpu_device_gpu_recover(ring->adev, job); > >> - return DRM_GPU_SCHED_STAT_NOMINAL; > >> + ret = amdgpu_device_gpu_recover(ring->adev, job); > >> + if (ret == DRM_GPU_SCHED_STAT_BAILING) > >> + return DRM_GPU_SCHED_STAT_BAILING; > >> + else > >> + return DRM_GPU_SCHED_STAT_NOMINAL; > >> } else { > >> drm_sched_suspend_timeout(&ring->sched); > >> if (amdgpu_sriov_vf(adev)) > >> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c > >> b/drivers/gpu/drm/panfrost/panfrost_job.c > >> index 6003cfeb1322..e2cb4f32dae1 100644 > >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c > >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c > >> @@ -444,7 +444,7 @@ static enum drm_gpu_sched_stat > >> panfrost_job_timedout(struct drm_sched_job > >> * spurious. Bail out. > >> */ > >> if (dma_fence_is_signaled(job->done_fence)) > >> - return DRM_GPU_SCHED_STAT_NOMINAL; > >> + return DRM_GPU_SCHED_STAT_BAILING; > >> dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, > >> status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", > >> js, > >> @@ -456,7 +456,7 @@ static enum drm_gpu_sched_stat > >> panfrost_job_timedout(struct drm_sched_job > >> /* Scheduler is already stopped, nothing to do. */ > >> if (!panfrost_scheduler_stop(&pfdev->js->queue[js], > sched_job)) > >> - return DRM_GPU_SCHED_STAT_NOMINAL; > >> + return DRM_GPU_SCHED_STAT_BAILING; > >> /* Schedule a reset if there's no reset in progress. */ > >> if (!atomic_xchg(&pfdev->reset.pending, 1)) diff --git > >> a/drivers/gpu/drm/scheduler/sched_main.c > >> b/drivers/gpu/drm/scheduler/sched_main.c > >> index 92d8de24d0a1..a44f621fb5c4 100644 > >> --- a/drivers/gpu/drm/scheduler/sched_main.c > >> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct > >> work_struct *work) { > >> struct drm_gpu_scheduler *sched; > >> struct drm_sched_job *job; > >> + int ret; > >> sched = container_of(work, struct drm_gpu_scheduler, > >> work_tdr.work); > >> @@ -331,8 +332,13 @@ static void drm_sched_job_timedout(struct > >> work_struct *work) > >> list_del_init(&job->list); > >> spin_unlock(&sched->job_list_lock); > >> - job->sched->ops->timedout_job(job); > >> + ret = job->sched->ops->timedout_job(job); > >> + if (ret == DRM_GPU_SCHED_STAT_BAILING) { > >> + spin_lock(&sched->job_list_lock); > >> + list_add(&job->node, &sched->ring_mirror_list); > >> + spin_unlock(&sched->job_list_lock); > >> + } > > > At this point we don't hold GPU reset locks anymore, and so we could > be racing against another TDR thread from another scheduler ring of > same > device > or another XGMI hive member. The other thread might be in the middle of > luckless > iteration of mirror list (drm_sched_stop, drm_sched_start and > drm_sched_resubmit) > and so locking job_list_lock will not help. Looks like it's required to > take all GPU rest locks > here. > > Andrey > > > >> /* > >> * Guilty job did complete and hence needs to be manually > >> removed > >> * See drm_sched_stop doc. > >> diff --git a/include/drm/gpu_scheduler.h > >> b/include/drm/gpu_scheduler.h index 4ea8606d91fe..8093ac2427ef > 100644 > >> --- a/include/drm/gpu_scheduler.h > >> +++ b/include/drm/gpu_scheduler.h > >> @@ -210,6 +210,7 @@ enum drm_gpu_sched_stat { > >> DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */ > >> DRM_GPU_SCHED_STAT_NOMINAL, > >> DRM_GPU_SCHED_STAT_ENODEV, > >> + DRM_GPU_SCHED_STAT_BAILING, > >> }; > >> /** > >> -- > >> 2.25.1 > >> _______________________________________________ > >> amd-gfx mailing list > >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx <mailto:amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > >> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7Ce90f30af0f43444c6aea08d8e91860c4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637515638213180413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NnLqtz%2BZ8%2BweYwCqRinrfkqmhzibNAF6CYSdVqL6xi0%3D&reserved=0 > > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis > ts.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CJack. > Zhang1%40amd.com%7C95b2ff206ee74bbe520a08d8e956f5dd%7C3dd8961fe4884e60 > 8e11a82d994e183d%7C0%7C0%7C637515907000888939%7CUnknown%7CTWFpbGZsb3d8 > eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2 > 000&sdata=BGoSfOYiDar8SrpMx%2BsOMWpaMr87bxB%2F9ycu0FhhipA%3D&reserved= > 0> > > >> > > > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx