Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2021-08-24 3:24 a.m., Liu, Monk wrote:

[AMD Official Use Only]

Hi Andrey

Sorry that it is really hard for me to get any particular or solid potential bugs from your reply, can you be more specific, e.g.: what kind of race issue is introduced by this "kthread_stop/start" approach.


Hey, you might have missed my replies in the thread regarding this. Check them here.

https://www.spinics.net/lists/amd-gfx/msg67041.html
https://www.spinics.net/lists/amd-gfx/msg67090.html

In summery IMHO we can park/unpark only within serialized section against all other possible TDR handlers (at whole ASIC or even XGMI hive level). Today we achieve this by locking. IN the new proposal there is no locking - so we either add one or just serialize TDRs to single thread execution. Let me know if you think it's not an issue actually - i might be missing something.

Andrey



To your another question/concern:
. In a constant rapid stream of jobs each new job comming will try to start the timer but most of the time this operation just bails out as there is already pending timer from one of the previous jobs which cancels out any new ones [1] so, when the TO handler does execute eventually it's not because something wrong but simply because TO has
Expired

I totally agree withy you on this point, and I think I have a patch to address this, but this problem is not related with our current topic at all ... our  current topic is the bailout bad job handling from advanced TDR mode.

The bug here is our current TO handler only do the counting on the first job to the given scheduler, and the following coming job won't recalculate the TO at all, and I can assure you that this is a regression because when I implement TDR years ago I already considered planned for such problem.
Please check this change to resolve it:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..7b5f99a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -235,6 +235,13 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
                 schedule_delayed_work(&sched->work_tdr, sched->timeout);
  }
+static void drm_sched_restart_timeout(struct drm_gpu_scheduler *sched)
+{
+       if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
+           !list_empty(&sched->pending_list))
+               mod_delayed_work(system_wq, &sched->work_tdr, sched->timeout);
+}
+
  /**
   * drm_sched_fault - immediately start timeout handler
   *
@@ -693,6 +682,11 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
         if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
                 /* remove job from pending_list */
                 list_del_init(&job->list);
+
+               /* once the job deleted from pending list we should restart
+                * the timeout calculation for the next job.
+                */
+               drm_sched_restart_timeout(sched);
                 /* make the scheduled timestamp more accurate */
                 next = list_first_entry_or_null(&sched->pending_list,
                                                 typeof(*next), list);


if you guys do not have concerns I can submit this patch for review, but again, let's focus on bailing out had job handling as our priority, we are very close to our purpose, let me know what's your concerned race issue and we can address it.

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Sent: Friday, August 20, 2021 10:07 PM
To: Liu, Monk <Monk.Liu@xxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: Alex Deucher <alexdeucher@xxxxxxxxx>; Chen, JingWen <JingWen.Chen2@xxxxxxx>; Maling list - DRI developers <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."


On 2021-08-20 3:12 a.m., Liu, Monk wrote:
[AMD Official Use Only]

@Daniel Vetter @Grodzovsky, Andrey @Koenig, Christian
Do you have any concern on the kthread_park() approach ?

Theoretically speaking sched_main shall run there exclusively with
job_timeout since they both touches jobs, and stop scheduler during
job_timeout won't impact performance since in that scenario There was
already something wrong/stuck on that ring/scheduler

Regarding last paragraph, and specifically the claim that there was already something wrong if the TO handler starts execution - Not sure about this and I wonder if we have a potential bug here - when we start the timeout timer in drm_sched_job_begin we do it for each new incoming job. In a constant rapid stream of jobs each new job comming will try to start the timer but most of the time this operation just bails out as there is already pending timer from one of the previous jobs which cancels out any new ones [1] so, when the TO handler does execute eventually it's not because something wrong but simply because TO has expired. If in this case the pending list not empty a false TDR will be triggered. I think long ago we used TO handler per job and not per scheduler, this would solve this problem but hurt the serialization issue we are trying to solve. So not sure what to do.

[1] -
https://elixir.bootlin.com/linux/v5.14-rc1/source/kernel/workqueue.c#L1665

Andrey

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Liu, Monk
Sent: Thursday, August 19, 2021 6:26 PM
To: Daniel Vetter <daniel@xxxxxxxx>; Grodzovsky, Andrey
<Andrey.Grodzovsky@xxxxxxx>
Cc: Alex Deucher <alexdeucher@xxxxxxxxx>; Chen, JingWen
<JingWen.Chen2@xxxxxxx>; Maling list - DRI developers
<dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list
<amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>
Subject: RE: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

[AMD Official Use Only]

Hi Daniel

Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.
Yeah we had this though as well in our mind.

Our second approach is to call ktrhead_stop() in job_timedout() routine so that  the "bad" job is guaranteed to be used without scheduler's touching or freeing, Check this sample patch one as well please:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index a2a9536..50a49cb 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -319,17 +319,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
          sched = container_of(work, struct drm_gpu_scheduler,
work_tdr.work);
/* Protects against concurrent deletion in
drm_sched_get_cleanup_job */
+       kthread_park(sched->thread);
          spin_lock(&sched->job_list_lock);
          job = list_first_entry_or_null(&sched->pending_list,
                                         struct drm_sched_job, list);
if (job) {
-               /*
-                * Remove the bad job so it cannot be freed by concurrent
-                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-                * is parked at which point it's safe.
-                */
-               list_del_init(&job->list);
                  spin_unlock(&sched->job_list_lock);
status = job->sched->ops->timedout_job(job);
@@ -345,6 +340,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
          } else {
                  spin_unlock(&sched->job_list_lock);
          }
+       kthread_unpark(sched->thread);
if (status != DRM_GPU_SCHED_STAT_ENODEV) {
                  spin_lock(&sched->job_list_lock); @@ -393,20 +389,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
          kthread_park(sched->thread);
/*
-        * Reinsert back the bad job here - now it's safe as
-        * drm_sched_get_cleanup_job cannot race against us and release the
-        * bad job at this point - we parked (waited for) any in progress
-        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-        * now until the scheduler thread is unparked.
-        */
-       if (bad && bad->sched == sched)
-               /*
-                * Add at the head of the queue to reflect it was the earliest
-                * job extracted.
-                */
-               list_add(&bad->list, &sched->pending_list);
-
-       /*
           * Iterate the job list from later to  earlier one and either deactive
           * their HW callbacks or remove them from pending list if they already
           * signaled.


Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@xxxxxxxx>
Sent: Thursday, August 19, 2021 5:31 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Cc: Daniel Vetter <daniel@xxxxxxxx>; Alex Deucher
<alexdeucher@xxxxxxxxx>; Chen, JingWen <JingWen.Chen2@xxxxxxx>; Maling
list - DRI developers <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; amd-gfx list
<amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Koenig,
Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
On 2021-08-18 10:02 a.m., Alex Deucher wrote:

+ dri-devel

Since scheduler is a shared component, please add dri-devel on
all scheduler patches.

On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@xxxxxxx> wrote:
[Why]
for bailing job, this commit will delete it from pending list
thus the bailing job will never have a chance to be resubmitted
even in advance tdr mode.

[How]
after embeded hw_fence into amdgpu_job is done, the race
condition that this commit tries to work around is completely
solved.So revert this commit.
This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
v2:
add dma_fence_get/put() around timedout_job to avoid concurrent
delete during processing timedout_job

Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx>
---
      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
      1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index a2a953693b45..f9b9b3aefc4a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
      {
             struct drm_gpu_scheduler *sched;
             struct drm_sched_job *job;
+       struct dma_fence *fence;
             enum drm_gpu_sched_stat status =
DRM_GPU_SCHED_STAT_NOMINAL;

             sched = container_of(work, struct
drm_gpu_scheduler, work_tdr.work); @@ -325,11 +326,10 @@ static
void drm_sched_job_timedout(struct work_struct
*work)

             if (job) {
                     /*
-                * Remove the bad job so it cannot be freed by concurrent
-                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
-                * is parked at which point it's safe.
+                * Get job->s_fence->parent here to avoid concurrent delete during
+                * processing timedout_job
                      */
-               list_del_init(&job->list);
+               fence =
+ dma_fence_get(job->s_fence->parent);
While this is true for amdgpu, it has no meaning for other
drivers for whom we haven't done the refactoring of embedding HW
fence (parent) into the job structure.
In fact thinking
about it, unless you do the HW fence embedding for all the
drivers using the scheduler you cannot revert this patch or you
will just break them.
btw, why did you do that embedding? I do still have my patches
with dma_fence annotations floating around, but my idea at least
was to fix that issue with a mempool, not with embeddeding. What
was the motivation for embedding the wh fence?
-Daniel
The motivation was 2 fold, avoid memory allocation during jobs
submissions (HW fence allocation) because as Christian explained
this leads to deadlock with mm code during evictions due to memory
pressure (Christian can clarify if I messed
Yeah that's the exact same thing I've chased with my dma_fence
annotations, but thus far zero to none interested in getting it
sorted. I think it'd be good to have some cross-driver agreement on
how this should be solved before someone just charges ahead ...

this explanation). Second is to exactly revert this patch because
while it solved the issue described in the patch it created another
with drivers who baildc out early during TDR handling for various
reason and the job would just leak because it was already removed
form pending list.
Can't we reinsert it before we restart the scheduler thread? It
might need a separate list for that due to the lockless queue
tricks. Or am I thinking about the wrong kind of "we lost the job"?
-Danile
If you look at the original patch it would reinsert it even earlier -
right after stopping the  SW scheduler thread, and even then it was
to late for some drivers as they would decide to return back from
their TDR handler even before that. It is solvable but in an ugly way
as far as I see, you need to require each driver in his code to put
the job back in the list if they do it before reaching the place
where scheduler framework does it. Kind of spaghetti code seems to me.
Hm yeah I didn't realize this all happens before we stop the scheduler thread.

Why can't we stop the scheduler thread first, so that there's guaranteed no race? I've recently had a lot of discussions with panfrost folks about their reset that spawns across engines, and without stopping the scheduler thread first before you touch anything it's just plain impossible.

I'm also still not understanding what exactly you guys have done, can someone please dig out the the amdgpu patches that motivate all this maybe that's clearer? A full explanation would still be good since I've only started in scheduler stuff.

Another thing I recently pondered for tdr races looking at i915 code is whether the tdr should first block the completion fence for that job. My motivation is to have a race-free error capture (if the completion races then we might start evicting memory and everything goes boom), but maybe that helps here too. Some kind of atomic "block this fence from completing thing.

Or I'm I completely guessing in the wrong direction?
-Daniel

Andrey


Andrey


Andrey


                     spin_unlock(&sched->job_list_lock);

                     status =
job->sched->ops->timedout_job(job);
@@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
                             job->sched->ops->free_job(job);
                             sched->free_guilty = false;
                     }
+               dma_fence_put(fence);
             } else {
                     spin_unlock(&sched->job_list_lock);
             }
@@ -392,20 +393,6 @@ void drm_sched_stop(struct
drm_gpu_scheduler *sched, struct drm_sched_job *bad)

             kthread_park(sched->thread);

-       /*
-        * Reinsert back the bad job here - now it's safe as
-        * drm_sched_get_cleanup_job cannot race against us and release the
-        * bad job at this point - we parked (waited for) any in progress
-        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
-        * now until the scheduler thread is unparked.
-        */
-       if (bad && bad->sched == sched)
-               /*
-                * Add at the head of the queue to reflect it was the earliest
-                * job extracted.
-                */
-               list_add(&bad->list, &sched->pending_list);
-
             /*
              * Iterate the job list from later to  earlier one and either deactive
              * their HW callbacks or remove them from pending
list if they already
--
2.25.1

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7C27fcce7ca8dd4f3960
8508d962f40f33%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376496226
57672189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JVZtg3AhbiA%2FDmVbNGo3M
xVliO83nh8%2Fi50PCMsvwyY%3D&amp;reserved=0



[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux