On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote: > > On 2021-09-01 12:25 a.m., Jingwen Chen wrote: > > On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: > > > I will answer everything here - > > > > > > On 2021-08-31 9:58 p.m., Liu, Monk wrote: > > > > > > > > > [AMD Official Use Only] > > > > > > > > > In the previous discussion, you guys stated that we should drop the > > > “kthread_should_park” in cleanup_job. > > > > > > > > > @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler > > > *sched) > > > > > > { > > > > > > struct drm_sched_job *job, *next; > > > > > > > > > - /* > > > > > > - * Don't destroy jobs while the timeout worker is running OR > > > thread > > > > > > - * is being parked and hence assumed to not touch pending_list > > > > > > - */ > > > > > > - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && > > > > > > - !cancel_delayed_work(&sched->work_tdr)) || > > > > > > - kthread_should_park()) > > > > > > - return NULL; > > > > > > > > > But I suddenly have a question here: if return the timedout job no matter > > > kthread_should_park() or not, then we are backing to the original problem > > > again: that the timedout_job is suddenly signaling and cleanup_job still > > > returns it to sched_main and job is freed while it is still handling by > > > vendor’s timeout callback > > > > > > > > > If we return NULL when kthread_should_park() in cleanup_job, we can prevent > > > above scenario from happening: once a job is processed by job_timedout we > > > can stop its scheduler, and after that even this job suddenly signaled the > > > cleanup_job won’t return it so sched_main won’t free it in parallel … > > > > > > > > > What do you think ? > > > > > > > > > Is your analysis above takes into account that you also submit > > > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a > > > problem - > > Hi Andrey, > > Monk has talked to me and we agreed that as there're multiple opinions about the > > '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch > > 1 is an independent patch to fix some error. So we should not take the patch 2 into > > analysis. > > > > > I think that as long as you put kthread_park(sched->thread) BEFORE > > > fetching next bad job from pending list (under spinlock) there is no > > > such issue as in the case you describe because this potential bad job > > > that became signaled will be removed from pending list before you > > > even fetch the next job and by the time you fetch it the scheduler > > > thread is already stopped anyway > > > > > > If you don't submit and we keep the removal hack for now then also no problem > > > because > > > we temporary remove the job we fetch for TDR from pending list under spinlock > > > exactly to avoid this race > > > > > So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)? > > patch v3 keeps this kthread_should_park check. > > > But since in both cases looks like there is no danger of use after free > then I see no reason to keep kthread_should_park check. > > Andrey OK, I get it. So patch v4 has removed this check, can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)? > > > > > > Best Regards, > > JingWen > > > > > > Thanks > > > > > > > > > ------------------------------------------ > > > > > > Monk Liu | Cloud-GPU Core team > > > > > > ------------------------------------------ > > > > > > > > > From: Liu, Monk > > > Sent: Wednesday, September 1, 2021 9:23 AM > > > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Grodzovsky, Andrey > > > <Andrey.Grodzovsky@xxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Chen, JingWen > > > <JingWen.Chen2@xxxxxxx> > > > Cc: DRI Development <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; > > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > > Subject: [diagnostic TDR mode patches] unify our solution opinions/ > > > suggestions in one thread > > > > > > > > > [AMD Official Use Only] > > > > > > > > > Hi Daniel/Christian/Andrey > > > > > > > > > It looks the voice from you three are spread over those email floods to me, > > > the feature we are working on (diagnostic TDR scheme) is pending there for > > > more than 6 month (we started it from feb 2021). > > > > > > > > > Honestly speaking the email ways that we are using now is not friendly and > > > quite painful to me …. > > > > > > Can we try to put all our opinions, suggestions, or even objects here > > > together, let’s go through them one by one, it’s too hard for us to reply > > > each email on different questions . > > > > > > > > > For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4) > > > > > > > > > This is a fixing patch on the timeout timer in scheduler, can we complete > > > this one first ? it should already resolved all the questions and > > > suggestions. > > > > > > > > > I have no objections for this one besides getting rid of the > > > kthread_should_park()) return null part, > > > if my answer above is not wrong then it seems superfluous to me > > > > > > > > > > > > For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler > > > > > > > > > I think I already explained the questions raised by Daniel in other thread > > > , regarding why I use __kthread_should_park() > > > > > > > > > Is this race free ? Can't the other thread execute kthread_park after the check > > > ? > > > > > > > > > For other aspects, can we put all our opinion synthesized here ? > > > > > > > > > So to summarize from previous threads I think that the best solution > > > to the problem being solved in this patch is if we do HW fence embedding > > > at the drm_sched_job level instead of doing it only for amdgpu, and modifying > > > all > > > the drivers to support this we can both remove this hack and solve the race > > > against concurrent drm_sched_cleanup_jobs job freeing just by taking reference > > > to the hw fence of the job at the beginning of drm_sched_job_timedout > > > > > > If doing this refactoring for all the drivers is not an option now and you need > > > a quick > > > solution such as the serialization you do here then take into account other > > > concurrent > > > TDR handlers that might run, as mentioned before, without serializing against > > > other TDR handlers from other > > > schedulers you just race here against them, e.g. you parked it now but another > > > one in progress will unpark it as part of calling drm_sched_start for other > > > rings. > > > So you either need a global lock or dedicated single threaded queue as Daniel > > > suggested. > > > At minimum we should change cancel_delayed_work in drm_sched_stop to > > > cancel_delayed_work_sync > > > to cancel and flush all concurrent TDRs and probably move it to the begining of > > > the function, after kthread_park > > > and before we start playing with the pending list. > > > > > > P.S One comment I had regarding single threaded queue is that in case we have > > > multiple TDR > > > arising from same event we will proceed to resetting multiple times - something > > > that with trylock > > > we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and > > > amdgpu_device_lock_hive_adev) > > > Daniel mentioned it's not a problem but I still haven't understood why not. > > > > > > Andrey > > > > > > > > > > > > Thanks ! > > > > > > > > > ------------------------------------------ > > > > > > Monk Liu | Cloud-GPU Core team > > > > > > ------------------------------------------ > > > > > >