Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Wed, 10 Nov 2021 07:50:47 -0500

On 2021-11-10 5:09 a.m., Christian König wrote:
Am 10.11.21 um 10:50 schrieb Daniel Vetter:
On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
I stumbled across this thread when I ran into the same issue, while
working out how to move drm/msm to use scheduler's retire +
timeout/recovery (and get rid of our own mirror list of in-flight
jobs).  We already have hw error detection enabled, and it can signal
quite fast, so assuming the first job on the list is the guilty job
just won't work.

But I was considering a slightly different approach to fixing this,
instead just handling it all in drm_sched_main() and getting rid of
the complicated kthread parking gymnastics.  Ie. something along the
lines of:
So handling timeouts in the main sched thread wont work as soon as you
have multiple engines and reset that impacts across engines:

- Nothing is simplified since you still need to stop the other 
scheduler
   threads.

- You get deadlocks if 2 schedulers time out at the same time, and 
both
   want to stop the other one.

Hence workqueue. Now the rule for the wq is that you can only have 
one per
reset domain, so
- single engine you just take the one drm/sched provides
- if reset affects all your engines in the chip, then you allocate 
on in
   the drm_device and pass that to all
- if you have a complex of gpus all interconnected (e.g. xgmi hive for
   amd), then it's one wq for the entire hive

_All_ reset related things must be run on that workqueue or things 
breaks,
which means if you get hw fault that also needs to be run there. I 
guess
we should either patch drm/sched to check you call that function 
from the
right workqueue, or just handle it internally.
Hmm, ok.. I guess it would be useful to better document the reasoning
for the current design, that would have steered me more towards the
approach taken in this patch.
Maybe this was because you worked on an old kernel? Boris did update the
kerneldoc as part of making gpu reset work for panfrost, which has this
multi-engine reset problem. If that's not yet clear then we need to
improve the docs further.

AMD's problem is even worse, because their reset domain is the entire 
xgmi
hive, so multiple pci devices.

I'm pushing for quite a while that we get something like an 
amdgpu_reset_domain structure or similar for this, but we 
unfortunately don't have that yet.

Maybe it should be a good idea to have something like a 
drm_sched_domain or similar with all the necessary information for the 
inter scheduler handling.

E.g. a workqueue for reset etc...

Regards,
Christian.


I think Monk and Jingwen already switched SRIOV case to using Boris's 
single threaded queue
interface for SRIOV, we can try to expand this to general bare metal 
case for AMD and on the way
to add drm_sched_domain to the scheduler.

Andrey




Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
ordering/barriers and I'm pretty sure a lot are still missing. Or at 
least
we should have comments in the code explaining why it all works.
-Daniel

BR,
-R

-Daniel

---------------------

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 67382621b429..4d6ce775c316 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
drm_gpu_scheduler *sched)
         return false;
  }

+static bool handle_timeout(struct drm_gpu_scheduler *sched)
+{
+       struct drm_sched_job *bad;
+
+       if (!sched->has_timeout)
+               return false;
+
+       sched->has_timeout = false;
+
+       spin_lock(&sched->job_list_lock);
+       bad = list_first_entry_or_null(&sched->pending_list,
+                                      struct drm_sched_job, list);
+
+       if (!bad) {
+               spin_unlock(&sched->job_list_lock);
+               return false;
+       }
+
+       spin_unlock(&sched->job_list_lock);
+
+       if (sched->timeout_wq == system_wq) {
+               /*
+                * If driver has no specific requirements about 
serializing
+                * reset wrt. other engines, just call 
timedout_job() directly
+                */
+               sched->ops->timedout_job(job);
+       } else {
+               /*
+                * Otherwise queue it on timeout_wq and wait for 
it to complete
+                */
+               ... more typing needed here ...
+       }
+
+       if (sched->free_guilty) {
+               sched->ops->free_job(job);
+               sched->free_guilty = false;
+       }
+}
+
  /**
   * drm_sched_main - main scheduler thread
   *
@@ -787,6 +826,7 @@ static int drm_sched_main(void *param)

wait_event_interruptible(sched->wake_up_worker,
                                          (cleanup_job =
drm_sched_get_cleanup_job(sched)) ||
+ handle_timeout(sched) ||
(!drm_sched_blocked(sched) &&
                                           (entity =
drm_sched_select_entity(sched))) ||
kthread_should_stop());
---------------------

drm_sched_fault() and the sw timeout handler would just set
sched->has_timeout and kick sched->wake_up_worker.

And since we handle the timeout case after
drm_sched_get_cleanup_job(), we know that all of the successfully
completed jobs have already been popped off the list, and won't be
unfairly maligned.

BR,
-R

On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@xxxxxxx> wrote:
[AMD Official Use Only]

Okay, I will reprepare this patch

Thanks

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: Daniel Vetter <daniel@xxxxxxxx>
Sent: Tuesday, August 31, 2021 9:02 PM
To: Liu, Monk <Monk.Liu@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; 
dri-devel@xxxxxxxxxxxxxxxxxxxxx; Chen, Jingwen 
<Jingwen.Chen@xxxxxxx>
Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and 
scheduler

On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
Can we please have some actual commit message here, with detailed
explanation of the race/bug/whatever, how you fix it and why 
this is
the best option?

On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
tested-by: jingwen chen <jingwen.chen@xxxxxxx>
Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
Signed-off-by: jingwen chen <jingwen.chen@xxxxxxx>
---
  drivers/gpu/drm/scheduler/sched_main.c | 24
++++--------------------
  1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index ecf8140..894fdb24 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct 
work_struct *work)
     sched = container_of(work, struct drm_gpu_scheduler,
work_tdr.work);

     /* Protects against concurrent deletion in
drm_sched_get_cleanup_job */
+   if (!__kthread_should_park(sched->thread))
This is a __ function, i.e. considered internal, and it's lockless
atomic, i.e. unordered. And you're not explaining why this works.

Iow it's probably buggy, and an just unconditionally parking the
kthread is probably the right thing to do. If it's not the right 
thing
to do, there's a bug here for sure.
Also why don't we reuse the function drivers already have to stop 
a scheduler thread? We seem to have two kthread_park now, that's 
probably one too much.
-Daniel

+ kthread_park(sched->thread);
+
     spin_lock(&sched->job_list_lock);
     job = list_first_entry_or_null(&sched->pending_list,
                                    struct drm_sched_job, list);

     if (job) {
-           /*
-            * Remove the bad job so it cannot be freed by 
concurrent
-            * drm_sched_cleanup_jobs. It will be reinserted 
back after sched->thread
-            * is parked at which point it's safe.
-            */
-           list_del_init(&job->list);
spin_unlock(&sched->job_list_lock);

+           /* vendor's timeout_job should call 
drm_sched_start() */
             status = job->sched->ops->timedout_job(job);

             /*
@@ -393,20 +391,6 @@ void drm_sched_stop(struct 
drm_gpu_scheduler *sched, struct drm_sched_job *bad)
     kthread_park(sched->thread);

     /*
-    * Reinsert back the bad job here - now it's safe as
-    * drm_sched_get_cleanup_job cannot race against us and 
release the
-    * bad job at this point - we parked (waited for) any in 
progress
-    * (earlier) cleanups and drm_sched_get_cleanup_job will 
not be called
-    * now until the scheduler thread is unparked.
-    */
-   if (bad && bad->sched == sched)
-           /*
-            * Add at the head of the queue to reflect it was 
the earliest
-            * job extracted.
-            */
-           list_add(&bad->list, &sched->pending_list);
-
-   /*
      * Iterate the job list from later to  earlier one and 
either deactive
      * their HW callbacks or remove them from pending list if 
they already
      * signaled.
--
2.7.4

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. 

ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76 

b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170 

51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL 

CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg 

KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C3c31e06d94674e61f3a008d9a4323fb1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721358003695123%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FMFNbfIxwNG3Tv1p1iHLI%2BpepJgwvsT%2FxhL%2FKZc0eVE%3D&amp;reserved=0 

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C3c31e06d94674e61f3a008d9a4323fb1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637721358003695123%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FMFNbfIxwNG3Tv1p1iHLI%2BpepJgwvsT%2FxhL%2FKZc0eVE%3D&amp;reserved=0