Thanks
------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------
From: Liu, Monk
Sent: Wednesday, September 1, 2021 9:23 AM
To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Grodzovsky, Andrey
<Andrey.Grodzovsky@xxxxxxx>; Daniel Vetter <daniel@xxxxxxxx>; Chen, JingWen
<JingWen.Chen2@xxxxxxx>
Cc: DRI Development <dri-devel@xxxxxxxxxxxxxxxxxxxxx>;
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: [diagnostic TDR mode patches] unify our solution opinions/
suggestions in one thread
[AMD Official Use Only]
Hi Daniel/Christian/Andrey
It looks the voice from you three are spread over those email floods to me,
the feature we are working on (diagnostic TDR scheme) is pending there for
more than 6 month (we started it from feb 2021).
Honestly speaking the email ways that we are using now is not friendly and
quite painful to me ….
Can we try to put all our opinions, suggestions, or even objects here
together, let’s go through them one by one, it’s too hard for us to reply
each email on different questions .
For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
This is a fixing patch on the timeout timer in scheduler, can we complete
this one first ? it should already resolved all the questions and
suggestions.
I have no objections for this one besides getting rid of the
kthread_should_park()) return null part,
if my answer above is not wrong then it seems superfluous to me
For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
I think I already explained the questions raised by Daniel in other thread
, regarding why I use __kthread_should_park()
Is this race free ? Can't the other thread execute kthread_park after the check
?
For other aspects, can we put all our opinion synthesized here ?
So to summarize from previous threads I think that the best solution
to the problem being solved in this patch is if we do HW fence embedding
at the drm_sched_job level instead of doing it only for amdgpu, and modifying
all
the drivers to support this we can both remove this hack and solve the race
against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
to the hw fence of the job at the beginning of drm_sched_job_timedout
If doing this refactoring for all the drivers is not an option now and you need
a quick
solution such as the serialization you do here then take into account other
concurrent
TDR handlers that might run, as mentioned before, without serializing against
other TDR handlers from other
schedulers you just race here against them, e.g. you parked it now but another
one in progress will unpark it as part of calling drm_sched_start for other
rings.
So you either need a global lock or dedicated single threaded queue as Daniel
suggested.
At minimum we should change cancel_delayed_work in drm_sched_stop to
cancel_delayed_work_sync
to cancel and flush all concurrent TDRs and probably move it to the begining of
the function, after kthread_park
and before we start playing with the pending list.
P.S One comment I had regarding single threaded queue is that in case we have
multiple TDR
arising from same event we will proceed to resetting multiple times - something
that with trylock
we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
amdgpu_device_lock_hive_adev)
Daniel mentioned it's not a problem but I still haven't understood why not.
Andrey
Thanks !
------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------