Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

"Deucher, Alexander" <Alexander.Deucher@xxxxxxx> · Tue, 26 Nov 2019 15:36:57 +0000

I recently updated amd-staging-drm-next.  Apply whatever makes sense for now and it'll naturally fall out in the next rebase.

Alex

From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>

Sent: Monday, November 25, 2019 7:09 PM

To: Deng, Emily <Emily.Deng@xxxxxxx>

Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; steven.price@xxxxxxx <steven.price@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>

Subject: Re: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

Christian asked to submit it to drm-misc instead of our drm-next to avoid later conflicts with Steven's patch which he mentioned in this thread which is not in drm-next yet.

Christian, Alex, once this merged to drm-misc I guess we need to pull all latest changes from there to drm-next so the issue Emily reported can be avoided.

Andrey

________________________________________

From: Deng, Emily <Emily.Deng@xxxxxxx>

Sent: 25 November 2019 16:44:36

To: Grodzovsky, Andrey

Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig, Christian; steven.price@xxxxxxx; Grodzovsky, Andrey

Subject: RE: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

[AMD Official Use Only - Internal Distribution Only]

Hi Andrey,

    Seems you didn't submit this patch?

Best wishes

Emily Deng

>-----Original Message-----

>From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>

>Sent: Monday, November 25, 2019 12:51 PM

>Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig,

>Christian <Christian.Koenig@xxxxxxx>; Deng, Emily

><Emily.Deng@xxxxxxx>; steven.price@xxxxxxx; Grodzovsky, Andrey

><Andrey.Grodzovsky@xxxxxxx>

>Subject: [PATCH v4] drm/scheduler: Avoid accessing freed bad job.

>

>Problem:

>Due to a race between drm_sched_cleanup_jobs in sched thread and

>drm_sched_job_timedout in timeout work there is a possiblity that bad job

>was already freed while still being accessed from the timeout thread.

>

>Fix:

>Instead of just peeking at the bad job in the mirror list remove it from the list

>under lock and then put it back later when we are garanteed no race with

>main sched thread is possible which is after the thread is parked.

>

>v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs.

>

>v3: Rebase on top of drm-misc-next. v2 is not needed anymore as

>drm_sched_get_cleanup_job already has a lock there.

>

>v4: Fix comments to relfect latest code in drm-misc.

>

>Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>

>Reviewed-by: Christian König <christian.koenig@xxxxxxx>

>Tested-by: Emily Deng <Emily.Deng@xxxxxxx>

>---

> drivers/gpu/drm/scheduler/sched_main.c | 27

>+++++++++++++++++++++++++++

> 1 file changed, 27 insertions(+)

>

>diff --git a/drivers/gpu/drm/scheduler/sched_main.c

>b/drivers/gpu/drm/scheduler/sched_main.c

>index 6774955..1bf9c40 100644

>--- a/drivers/gpu/drm/scheduler/sched_main.c

>+++ b/drivers/gpu/drm/scheduler/sched_main.c

>@@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct

>work_struct *work)

>       unsigned long flags;

>

>       sched = container_of(work, struct drm_gpu_scheduler,

>work_tdr.work);

>+

>+      /* Protects against concurrent deletion in

>drm_sched_get_cleanup_job */

>+      spin_lock_irqsave(&sched->job_list_lock, flags);

>       job = list_first_entry_or_null(&sched->ring_mirror_list,

>                                      struct drm_sched_job, node);

>

>       if (job) {

>+              /*

>+               * Remove the bad job so it cannot be freed by concurrent

>+               * drm_sched_cleanup_jobs. It will be reinserted back after

>sched->thread

>+               * is parked at which point it's safe.

>+               */

>+              list_del_init(&job->node);

>+              spin_unlock_irqrestore(&sched->job_list_lock, flags);

>+

>               job->sched->ops->timedout_job(job);

>

>               /*

>@@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct

>work_struct *work)

>                       job->sched->ops->free_job(job);

>                       sched->free_guilty = false;

>               }

>+      } else {

>+              spin_unlock_irqrestore(&sched->job_list_lock, flags);

>       }

>

>       spin_lock_irqsave(&sched->job_list_lock, flags); @@ -370,6 +383,20

>@@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct

>drm_sched_job *bad)

>       kthread_park(sched->thread);

>

>       /*

>+       * Reinsert back the bad job here - now it's safe as

>+       * drm_sched_get_cleanup_job cannot race against us and release the

>+       * bad job at this point - we parked (waited for) any in progress

>+       * (earlier) cleanups and drm_sched_get_cleanup_job will not be

>called

>+       * now until the scheduler thread is unparked.

>+       */

>+      if (bad && bad->sched == sched)

>+              /*

>+               * Add at the head of the queue to reflect it was the earliest

>+               * job extracted.

>+               */

>+              list_add(&bad->node, &sched->ring_mirror_list);

>+

>+      /*

>        * Iterate the job list from later to  earlier one and either deactive

>        * their HW callbacks or remove them from mirror list if they already

>        * signaled.

>--

>2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx