Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

Daniel Vetter <daniel@xxxxxxxx> · Tue, 31 Aug 2021 15:11:35 +0200

On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
> On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
> > 
> > On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
> > > On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
> > > > > On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
> > > > > > > On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
> > > > > > > > On 2021-08-18 10:02 a.m., Alex Deucher wrote:
> > > > > > > > 
> > > > > > > > > + dri-devel
> > > > > > > > > 
> > > > > > > > > Since scheduler is a shared component, please add dri-devel on all
> > > > > > > > > scheduler patches.
> > > > > > > > > 
> > > > > > > > > On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2@xxxxxxx> wrote:
> > > > > > > > > > [Why]
> > > > > > > > > > for bailing job, this commit will delete it from pending list thus the
> > > > > > > > > > bailing job will never have a chance to be resubmitted even in advance
> > > > > > > > > > tdr mode.
> > > > > > > > > > 
> > > > > > > > > > [How]
> > > > > > > > > > after embeded hw_fence into amdgpu_job is done, the race condition that
> > > > > > > > > > this commit tries to work around is completely solved.So revert this
> > > > > > > > > > commit.
> > > > > > > > > > This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
> > > > > > > > > > v2:
> > > > > > > > > > add dma_fence_get/put() around timedout_job to avoid concurrent delete
> > > > > > > > > > during processing timedout_job
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx>
> > > > > > > > > > ---
> > > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
> > > > > > > > > >      1 file changed, 5 insertions(+), 18 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > index a2a953693b45..f9b9b3aefc4a 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > >      {
> > > > > > > > > >             struct drm_gpu_scheduler *sched;
> > > > > > > > > >             struct drm_sched_job *job;
> > > > > > > > > > +       struct dma_fence *fence;
> > > > > > > > > >             enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
> > > > > > > > > > 
> > > > > > > > > >             sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
> > > > > > > > > > @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > > 
> > > > > > > > > >             if (job) {
> > > > > > > > > >                     /*
> > > > > > > > > > -                * Remove the bad job so it cannot be freed by concurrent
> > > > > > > > > > -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > > > > -                * is parked at which point it's safe.
> > > > > > > > > > +                * Get job->s_fence->parent here to avoid concurrent delete during
> > > > > > > > > > +                * processing timedout_job
> > > > > > > > > >                      */
> > > > > > > > > > -               list_del_init(&job->list);
> > > > > > > > > > +               fence = dma_fence_get(job->s_fence->parent);
> > > > > > > > While this is true for amdgpu, it has no meaning for other drivers for whom
> > > > > > > > we haven't
> > > > > > > > done the refactoring of embedding HW fence (parent) into the job structure.
> > > > > > > > In fact thinking
> > > > > > > > about it, unless you do the HW fence embedding for all the drivers using the
> > > > > > > > scheduler you cannot
> > > > > > > > revert this patch or you will just break them.
> > > > > > > btw, why did you do that embedding? I do still have my patches with
> > > > > > > dma_fence annotations floating around, but my idea at least was to fix
> > > > > > > that issue with a mempool, not with embeddeding. What was the motivation
> > > > > > > for embedding the wh fence?
> > > > > > > -Daniel
> > > > > > The motivation was 2 fold, avoid memory allocation during jobs submissions
> > > > > > (HW fence allocation) because as Christian explained this leads to deadlock
> > > > > > with
> > > > > > mm code during evictions due to memory pressure (Christian can clarify if I
> > > > > > messed
> > > > > Yeah that's the exact same thing I've chased with my dma_fence
> > > > > annotations, but thus far zero to none interested in getting it sorted. I
> > > > > think it'd be good to have some cross-driver agreement on how this should
> > > > > be solved before someone just charges ahead ...
> > > > > 
> > > > > > this explanation). Second is to exactly revert this patch because while it
> > > > > > solved the issue
> > > > > > described in the patch it created another with drivers who baildc out early
> > > > > > during TDR handling
> > > > > > for various reason and the job would just leak because it was already
> > > > > > removed form pending list.
> > > > > Can't we reinsert it before we restart the scheduler thread? It might need
> > > > > a separate list for that due to the lockless queue tricks. Or am I
> > > > > thinking about the wrong kind of "we lost the job"?
> > > > > -Danile
> > > > 
> > > > If you look at the original patch it would reinsert it even earlier - right
> > > > after stopping the  SW scheduler thread, and even then it was to late for
> > > > some drivers as they would decide to return back from their TDR handler even
> > > > before that. It is solvable but in an ugly way as far as I see, you need to
> > > > require each driver in his code to put the job back in the list if they do
> > > > it before reaching the place where scheduler framework does it. Kind of
> > > > spaghetti code seems to me.
> > > Hm yeah I didn't realize this all happens before we stop the scheduler
> > > thread.
> > > 
> > > Why can't we stop the scheduler thread first, so that there's guaranteed
> > > no race? I've recently had a lot of discussions with panfrost folks about
> > > their reset that spawns across engines, and without stopping the scheduler
> > > thread first before you touch anything it's just plain impossible.
> > 
> > 
> > Talked with Christian on that, for each TDR we actually stop all the
> > schedulers for all the rings and not only the hanged ring since
> > ASIC reset will impact all the rings anyway. So we cannot allow
> > other timeout handlers for other rings run in parallel to ours
> > as they will stop/restart the threads we just stopped and rely
> > on them being stopped. So it's all done with device wide lock
> > inside the amdgpu tTDR handler. Only inside the locked
> > section then we may stop/restart the scheduler threads.
> > Christian also mentioned that you proposed at some point
> > to serialize all TDR handling into single threading for all rings - this
> > seems
> > like something that could be used - we then don't need any
> > locking against TDR handlers from other rings and then we may
> > stop the scheduler thread as first step
> > 
> > 
> > > 
> > > I'm also still not understanding what exactly you guys have done,
> > > can someone please dig out the the amdgpu patches that motivate all this
> > > maybe that's clearer? A full explanation would still be good since I've
> > > only started in scheduler stuff.
> > 
> > 
> > https://gitlab.freedesktop.org/agd5f/linux/-/commit/de7515d43659f852590645a688f8d493e4a18141
> 
> Uh, it would have been really good if this was discussed a bit wider
> beforehand. Now we have rather diverging approaches to this. Also would be
> really good to resurrect the dma_fence annotations too.
> 
> Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
> up rfc conversion patches for the other drivers.

Ping for this. Currently the hw fence is returned from the ->run_job
callback, and that's not great design.

If we embed it, then I think it should start existing latest from
drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
the right thing to do here is to have the hw fence as a pointer in
struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
allocated.

Otherwise we're just diverging across drivers and tempting them to do the
wrong thing with the current ->run_job callback interface.

Can you guys look into this?
-Daniel

> > > Another thing I recently pondered for tdr races looking at i915 code is
> > > whether the tdr should first block the completion fence for that job. My
> > > motivation is to have a race-free error capture (if the completion races
> > > then we might start evicting memory and everything goes boom), but maybe
> > > that helps here too. Some kind of atomic "block this fence from
> > > completing thing.
> > > 
> > > Or I'm I completely guessing in the wrong direction?
> > 
> > 
> > I think we already do it here - https://elixir.bootlin.com/linux/v5.14-rc1/source/drivers/gpu/drm/scheduler/sched_main.c#L410
> 
> Ah yes this works becase drm/sched has separate hw fence from the logical
> job fence.
> -Daniel
> 
> > 
> > Andrey
> > 
> > 
> > > -Daniel
> > > 
> > > > Andrey
> > > > 
> > > > 
> > > > > > Andrey
> > > > > > 
> > > > > > 
> > > > > > > > Andrey
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > > 
> > > > > > > > > >                     status = job->sched->ops->timedout_job(job);
> > > > > > > > > > @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > > > > >                             job->sched->ops->free_job(job);
> > > > > > > > > >                             sched->free_guilty = false;
> > > > > > > > > >                     }
> > > > > > > > > > +               dma_fence_put(fence);
> > > > > > > > > >             } else {
> > > > > > > > > >                     spin_unlock(&sched->job_list_lock);
> > > > > > > > > >             }
> > > > > > > > > > @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > > > > > 
> > > > > > > > > >             kthread_park(sched->thread);
> > > > > > > > > > 
> > > > > > > > > > -       /*
> > > > > > > > > > -        * Reinsert back the bad job here - now it's safe as
> > > > > > > > > > -        * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > > > > -        * bad job at this point - we parked (waited for) any in progress
> > > > > > > > > > -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > > > > -        * now until the scheduler thread is unparked.
> > > > > > > > > > -        */
> > > > > > > > > > -       if (bad && bad->sched == sched)
> > > > > > > > > > -               /*
> > > > > > > > > > -                * Add at the head of the queue to reflect it was the earliest
> > > > > > > > > > -                * job extracted.
> > > > > > > > > > -                */
> > > > > > > > > > -               list_add(&bad->list, &sched->pending_list);
> > > > > > > > > > -
> > > > > > > > > >             /*
> > > > > > > > > >              * Iterate the job list from later to  earlier one and either deactive
> > > > > > > > > >              * their HW callbacks or remove them from pending list if they already
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch