Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Jingwen Chen <Jingwen.Chen2@xxxxxxx> · Wed, 1 Sep 2021 12:40:15 +0800



On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
> > On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> > > I will answer everything here -
> > > 
> > > On 2021-08-31 9:58 p.m., Liu, Monk wrote:
> > > 
> > > 
> > >      [AMD Official Use Only]
> > > 
> > > 
> > >      In the previous discussion, you guys stated that we should drop the
> > >      “kthread_should_park” in cleanup_job.
> > > 
> > > 
> > >      @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
> > >      *sched)
> > > 
> > >      {
> > > 
> > >              struct drm_sched_job *job, *next;
> > > 
> > > 
> > >      -       /*
> > > 
> > >      -        * Don't destroy jobs while the timeout worker is running  OR
> > >      thread
> > > 
> > >      -        * is being parked and hence assumed to not touch pending_list
> > > 
> > >      -        */
> > > 
> > >      -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
> > > 
> > >      -           !cancel_delayed_work(&sched->work_tdr)) ||
> > > 
> > >      -           kthread_should_park())
> > > 
> > >      -               return NULL;
> > > 
> > > 
> > >      But I suddenly have a question here: if return the timedout job no matter
> > >      kthread_should_park() or not, then we are backing to the original problem
> > >      again: that the timedout_job is suddenly signaling and cleanup_job still
> > >      returns it to sched_main and job is freed while it is still handling by
> > >      vendor’s timeout callback
> > > 
> > > 
> > >      If we return NULL when kthread_should_park() in cleanup_job, we can prevent
> > >      above scenario from happening: once a job is processed by job_timedout we
> > >      can stop its scheduler, and after that even this job suddenly signaled the
> > >      cleanup_job won’t return it so sched_main won’t free it in parallel …
> > > 
> > > 
> > >      What do you think ?
> > > 
> > > 
> > > Is your analysis above takes into account that you also submit
> > > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
> > > problem -
> > Hi Andrey,
> > Monk has talked to me and we agreed that as there're multiple opinions about the
> > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
> > 1 is an independent patch to fix some error. So we should not take the patch 2 into
> > analysis.
> > 
> > > I think that as long as you put kthread_park(sched->thread) BEFORE
> > > fetching next bad job from pending list (under spinlock) there is no
> > > such issue as in the case you describe because this potential bad job
> > > that became signaled will be removed from pending list before you
> > > even fetch the next job and by the time you fetch it the scheduler
> > > thread is already stopped anyway
> > > 
> > > If you don't submit and we keep the removal hack for now then also no problem
> > > because
> > > we temporary remove the job we fetch for TDR from pending list under spinlock
> > > exactly to avoid this race
> > > 
> > So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
> > patch v3 keeps this kthread_should_park check.
> 
> 
> But since in both cases looks like there is no danger of use after free
> then I see no reason to keep kthread_should_park check.
> 
> Andrey
OK, I get it. So patch v4 has removed this check, can you help review
[PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?
> 
> 
> > 
> > Best Regards,
> > JingWen
> > > 
> > >      Thanks
> > > 
> > > 
> > >      ------------------------------------------
> > > 
> > >      Monk Liu | Cloud-GPU Core team
> > > 
> > >      ------------------------------------------
> > > 
> > > 
> > >      From: Liu, Monk
> > >      Sent: Wednesday, September 1, 2021 9:23 AM
> > >      To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Grodzovsky, Andrey
> > >      <Andrey.Grodzovsky@xxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Chen, JingWen
> > >      <JingWen.Chen2@xxxxxxx>
> > >      Cc: DRI Development <dri-devel@xxxxxxxxxxxxxxxxxxxxx>;
> > >      amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > >      Subject: [diagnostic TDR mode patches] unify our solution opinions/
> > >      suggestions in one thread
> > > 
> > > 
> > >      [AMD Official Use Only]
> > > 
> > > 
> > >      Hi Daniel/Christian/Andrey
> > > 
> > > 
> > >      It looks the voice from you three are spread over those email floods to me,
> > >      the feature we are working on (diagnostic TDR scheme) is pending there for
> > >      more than 6 month (we started it from feb 2021).
> > > 
> > > 
> > >      Honestly speaking the email ways that we are using now is not friendly and
> > >      quite painful to me ….
> > > 
> > >      Can we try to put all our opinions, suggestions, or even objects here
> > >      together, let’s go through them one by one, it’s too hard for us to reply
> > >      each email on different questions .
> > > 
> > > 
> > >      For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
> > > 
> > > 
> > >      This is a fixing patch on the timeout timer in scheduler, can we complete
> > >      this one first ? it should already resolved all the questions and
> > >      suggestions.
> > > 
> > > 
> > > I have no objections for this one besides getting rid of the
> > > kthread_should_park()) return null part,
> > > if my answer above is not wrong then it seems superfluous to me
> > > 
> > > 
> > > 
> > >      For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > > 
> > > 
> > >      I think I already explained the questions raised by Daniel in other thread
> > >      , regarding why I use __kthread_should_park()
> > > 
> > > 
> > > Is this race free ? Can't the other thread execute kthread_park after the check
> > > ?
> > > 
> > > 
> > >      For other aspects, can we put all our opinion synthesized here ?
> > > 
> > > 
> > > So to summarize from previous threads I think that the best solution
> > > to the problem being solved in this patch is if we do HW fence embedding
> > > at the drm_sched_job level instead of doing it only for amdgpu, and modifying
> > > all
> > > the drivers to support this we can both remove this hack and solve the race
> > > against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
> > > to the hw fence of the job at the beginning of drm_sched_job_timedout
> > > 
> > > If doing this refactoring for all the drivers is not an option now and you need
> > > a quick
> > > solution such as the serialization you do here then take into account other
> > > concurrent
> > > TDR handlers that might run, as mentioned before, without serializing against
> > > other TDR handlers from other
> > > schedulers you just race here against them, e.g. you parked it now but another
> > > one in progress will unpark it as part of calling  drm_sched_start for other
> > > rings.
> > > So you either need a global lock or dedicated single threaded queue as Daniel
> > > suggested.
> > > At minimum we should change cancel_delayed_work in drm_sched_stop to
> > > cancel_delayed_work_sync
> > > to cancel and flush all concurrent TDRs and probably move it to the begining of
> > > the function, after kthread_park
> > > and before we start playing with the pending list.
> > > 
> > > P.S One comment I had regarding single threaded queue is that in case we have
> > > multiple TDR
> > > arising from same event we will proceed to resetting multiple times - something
> > > that with trylock
> > > we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
> > > amdgpu_device_lock_hive_adev)
> > > Daniel mentioned it's not a problem but I still haven't understood why not.
> > > 
> > > Andrey
> > > 
> > > 
> > > 
> > >      Thanks !
> > > 
> > > 
> > >      ------------------------------------------
> > > 
> > >      Monk Liu | Cloud-GPU Core team
> > > 
> > >      ------------------------------------------
> > > 
> > >