RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

"Liu, Monk" <Monk.Liu@xxxxxxx> · Thu, 19 Aug 2021 03:01:26 +0000

[AMD Official Use Only]

Hi Andrey and Daniel

We worked for a really long time on this new feature to AMD that finally can pick up the bad job from all timedout ones, and the change in scheduler (get/put fence in drm_sched_job_timedout, and remove the bad job delete and put back) is the last piece for us.

While we understand and realized that after the "bad job list node delete logic" being removed from job_timedout,  there will be race issues introduced if vendor's job_timeout calback is accessing the bad job  in parallel of scheduler doing "sched->ops->free_job(leanup_job)".

And to not introduce impact at all on those vendors I'd like to proposal a very simple change (which introduced a new bool member for scheduler to indicate if the del/put-back logic is needed or not) , check  patch here below:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea468..5e0bdc4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -495,6 +495,8 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 		return r;
 	}
 
+	ring->sched.keep_bad_job = true;
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 92d8de2..e7ac384 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	struct dma_fence *f = NULL;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
@@ -328,7 +329,11 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		 * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
 		 * is parked at which point it's safe.
 		 */
-		list_del_init(&job->list);
+		if (sched->keep_bad_job == false)
+			list_del_init(&job->list);
+		else
+			f = dma_fence_get(job->s_fence->parent);//get parent fence here to prevent hw_fence dropping to zero due to sched-main's cleanup_jobs, for amdgpu once parent fence drop to zero the sched_job will be kfree-ed 
+
 		spin_unlock(&sched->job_list_lock);
 
 		job->sched->ops->timedout_job(job);
@@ -341,6 +346,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 			job->sched->ops->free_job(job);
 			sched->free_guilty = false;
 		}
+
+		dma_fence_put(f);
 	} else {
 		spin_unlock(&sched->job_list_lock);
 	}
@@ -396,7 +403,7 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 	 * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
 	 * now until the scheduler thread is unparked.
 	 */
-	if (bad && bad->sched == sched)
+	if (bad && bad->sched == sched && sched->keep_bad_job == false)
 		/*
 		 * Add at the head of the queue to reflect it was the earliest
 		 * job extracted.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 4ea8606..5f9a640 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -301,6 +301,7 @@ struct drm_gpu_scheduler {
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
+	bool keep_bad_job;
 };
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,


Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@xxxxxxxx> 
Sent: Wednesday, August 18, 2021 10:43 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Cc: Daniel Vetter <daniel@xxxxxxxx>; Alex Deucher <alexdeucher@xxxxxxxxx>; Chen, JingWen <JingWen.Chen2@xxxxxxx>; Maling list - DRI developers <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > 
> > > > + dri-devel
> > > > 
> > > > Since scheduler is a shared component, please add dri-devel on 
> > > > all scheduler patches.
> > > > 
> > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@xxxxxxx> wrote:
> > > > > [Why]
> > > > > for bailing job, this commit will delete it from pending list 
> > > > > thus the bailing job will never have a chance to be 
> > > > > resubmitted even in advance tdr mode.
> > > > > 
> > > > > [How]
> > > > > after embeded hw_fence into amdgpu_job is done, the race 
> > > > > condition that this commit tries to work around is completely 
> > > > > solved.So revert this commit.
> > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > v2:
> > > > > add dma_fence_get/put() around timedout_job to avoid 
> > > > > concurrent delete during processing timedout_job
> > > > > 
> > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx>
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > >    1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >    {
> > > > >           struct drm_gpu_scheduler *sched;
> > > > >           struct drm_sched_job *job;
> > > > > +       struct dma_fence *fence;
> > > > >           enum drm_gpu_sched_stat status = 
> > > > > DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > 
> > > > >           sched = container_of(work, struct drm_gpu_scheduler, 
> > > > > work_tdr.work); @@ -325,11 +326,10 @@ static void 
> > > > > drm_sched_job_timedout(struct work_struct *work)
> > > > > 
> > > > >           if (job) {
> > > > >                   /*
> > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > -                * is parked at which point it's safe.
> > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > +                * processing timedout_job
> > > > >                    */
> > > > > -               list_del_init(&job->list);
> > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > 
> > > While this is true for amdgpu, it has no meaning for other drivers 
> > > for whom we haven't done the refactoring of embedding HW fence 
> > > (parent) into the job structure.
> > > In fact thinking
> > > about it, unless you do the HW fence embedding for all the drivers 
> > > using the scheduler you cannot revert this patch or you will just 
> > > break them.
> > btw, why did you do that embedding? I do still have my patches with 
> > dma_fence annotations floating around, but my idea at least was to 
> > fix that issue with a mempool, not with embeddeding. What was the 
> > motivation for embedding the wh fence?
> > -Daniel
> 
> 
> The motivation was 2 fold, avoid memory allocation during jobs 
> submissions (HW fence allocation) because as Christian explained this 
> leads to deadlock with mm code during evictions due to memory pressure 
> (Christian can clarify if I messed

Yeah that's the exact same thing I've chased with my dma_fence annotations, but thus far zero to none interested in getting it sorted. I think it'd be good to have some cross-driver agreement on how this should be solved before someone just charges ahead ...

> this explanation). Second is to exactly revert this patch because 
> while it solved the issue described in the patch it created another 
> with drivers who baildc out early during TDR handling for various 
> reason and the job would just leak because it was already removed form 
> pending list.

Can't we reinsert it before we restart the scheduler thread? It might need a separate list for that due to the lockless queue tricks. Or am I thinking about the wrong kind of "we lost the job"?
-Danile

> 
> Andrey
> 
> 
> > 
> > 
> > > Andrey
> > > 
> > > 
> > > > >                   spin_unlock(&sched->job_list_lock);
> > > > > 
> > > > >                   status = job->sched->ops->timedout_job(job);
> > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >                           job->sched->ops->free_job(job);
> > > > >                           sched->free_guilty = false;
> > > > >                   }
> > > > > +               dma_fence_put(fence);
> > > > >           } else {
> > > > >                   spin_unlock(&sched->job_list_lock);
> > > > >           }
> > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct 
> > > > > drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > 
> > > > >           kthread_park(sched->thread);
> > > > > 
> > > > > -       /*
> > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > -        * now until the scheduler thread is unparked.
> > > > > -        */
> > > > > -       if (bad && bad->sched == sched)
> > > > > -               /*
> > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > -                * job extracted.
> > > > > -                */
> > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > -
> > > > >           /*
> > > > >            * Iterate the job list from later to  earlier one and either deactive
> > > > >            * their HW callbacks or remove them from pending 
> > > > > list if they already
> > > > > --
> > > > > 2.25.1
> > > > > 

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C8ddd8838028242eb82c708d9625678cf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637648945806335873%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uFdAwQH6yWm%2FC%2FdDeG8wXKNsOqI7dSQRGO9NbKkjYyU%3D&amp;reserved=0