On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote: > > I stumbled across this thread when I ran into the same issue, while > > working out how to move drm/msm to use scheduler's retire + > > timeout/recovery (and get rid of our own mirror list of in-flight > > jobs). We already have hw error detection enabled, and it can signal > > quite fast, so assuming the first job on the list is the guilty job > > just won't work. > > > > But I was considering a slightly different approach to fixing this, > > instead just handling it all in drm_sched_main() and getting rid of > > the complicated kthread parking gymnastics. Ie. something along the > > lines of: > > So handling timeouts in the main sched thread wont work as soon as you > have multiple engines and reset that impacts across engines: > > - Nothing is simplified since you still need to stop the other scheduler > threads. > > - You get deadlocks if 2 schedulers time out at the same time, and both > want to stop the other one. > > Hence workqueue. Now the rule for the wq is that you can only have one per > reset domain, so > - single engine you just take the one drm/sched provides > - if reset affects all your engines in the chip, then you allocate on in > the drm_device and pass that to all > - if you have a complex of gpus all interconnected (e.g. xgmi hive for > amd), then it's one wq for the entire hive > > _All_ reset related things must be run on that workqueue or things breaks, > which means if you get hw fault that also needs to be run there. I guess > we should either patch drm/sched to check you call that function from the > right workqueue, or just handle it internally. Hmm, ok.. I guess it would be useful to better document the reasoning for the current design, that would have steered me more towards the approach taken in this patch. BR, -R > -Daniel > > > > > --------------------- > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 67382621b429..4d6ce775c316 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct > > drm_gpu_scheduler *sched) > > return false; > > } > > > > +static bool handle_timeout(struct drm_gpu_scheduler *sched) > > +{ > > + struct drm_sched_job *bad; > > + > > + if (!sched->has_timeout) > > + return false; > > + > > + sched->has_timeout = false; > > + > > + spin_lock(&sched->job_list_lock); > > + bad = list_first_entry_or_null(&sched->pending_list, > > + struct drm_sched_job, list); > > + > > + if (!bad) { > > + spin_unlock(&sched->job_list_lock); > > + return false; > > + } > > + > > + spin_unlock(&sched->job_list_lock); > > + > > + if (sched->timeout_wq == system_wq) { > > + /* > > + * If driver has no specific requirements about serializing > > + * reset wrt. other engines, just call timedout_job() directly > > + */ > > + sched->ops->timedout_job(job); > > + } else { > > + /* > > + * Otherwise queue it on timeout_wq and wait for it to complete > > + */ > > + ... more typing needed here ... > > + } > > + > > + if (sched->free_guilty) { > > + sched->ops->free_job(job); > > + sched->free_guilty = false; > > + } > > +} > > + > > /** > > * drm_sched_main - main scheduler thread > > * > > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param) > > > > wait_event_interruptible(sched->wake_up_worker, > > (cleanup_job = > > drm_sched_get_cleanup_job(sched)) || > > + handle_timeout(sched) || > > (!drm_sched_blocked(sched) && > > (entity = > > drm_sched_select_entity(sched))) || > > kthread_should_stop()); > > --------------------- > > > > drm_sched_fault() and the sw timeout handler would just set > > sched->has_timeout and kick sched->wake_up_worker. > > > > And since we handle the timeout case after > > drm_sched_get_cleanup_job(), we know that all of the successfully > > completed jobs have already been popped off the list, and won't be > > unfairly maligned. > > > > BR, > > -R > > > > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@xxxxxxx> wrote: > > > > > > [AMD Official Use Only] > > > > > > Okay, I will reprepare this patch > > > > > > Thanks > > > > > > ------------------------------------------ > > > Monk Liu | Cloud-GPU Core team > > > ------------------------------------------ > > > > > > -----Original Message----- > > > From: Daniel Vetter <daniel@xxxxxxxx> > > > Sent: Tuesday, August 31, 2021 9:02 PM > > > To: Liu, Monk <Monk.Liu@xxxxxxx> > > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Chen, Jingwen <Jingwen.Chen@xxxxxxx> > > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler > > > > > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote: > > > > Can we please have some actual commit message here, with detailed > > > > explanation of the race/bug/whatever, how you fix it and why this is > > > > the best option? > > > > > > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote: > > > > > tested-by: jingwen chen <jingwen.chen@xxxxxxx> > > > > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > > > > > Signed-off-by: jingwen chen <jingwen.chen@xxxxxxx> > > > > > --- > > > > > drivers/gpu/drm/scheduler/sched_main.c | 24 > > > > > ++++-------------------- > > > > > 1 file changed, 4 insertions(+), 20 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > index ecf8140..894fdb24 100644 > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work) > > > > > sched = container_of(work, struct drm_gpu_scheduler, > > > > > work_tdr.work); > > > > > > > > > > /* Protects against concurrent deletion in > > > > > drm_sched_get_cleanup_job */ > > > > > + if (!__kthread_should_park(sched->thread)) > > > > > > > > This is a __ function, i.e. considered internal, and it's lockless > > > > atomic, i.e. unordered. And you're not explaining why this works. > > > > > > > > Iow it's probably buggy, and an just unconditionally parking the > > > > kthread is probably the right thing to do. If it's not the right thing > > > > to do, there's a bug here for sure. > > > > > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much. > > > -Daniel > > > > > > > > + kthread_park(sched->thread); > > > > > + > > > > > spin_lock(&sched->job_list_lock); > > > > > job = list_first_entry_or_null(&sched->pending_list, > > > > > struct drm_sched_job, list); > > > > > > > > > > if (job) { > > > > > - /* > > > > > - * Remove the bad job so it cannot be freed by concurrent > > > > > - * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread > > > > > - * is parked at which point it's safe. > > > > > - */ > > > > > - list_del_init(&job->list); > > > > > spin_unlock(&sched->job_list_lock); > > > > > > > > > > + /* vendor's timeout_job should call drm_sched_start() */ > > > > > status = job->sched->ops->timedout_job(job); > > > > > > > > > > /* > > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) > > > > > kthread_park(sched->thread); > > > > > > > > > > /* > > > > > - * Reinsert back the bad job here - now it's safe as > > > > > - * drm_sched_get_cleanup_job cannot race against us and release the > > > > > - * bad job at this point - we parked (waited for) any in progress > > > > > - * (earlier) cleanups and drm_sched_get_cleanup_job will not be called > > > > > - * now until the scheduler thread is unparked. > > > > > - */ > > > > > - if (bad && bad->sched == sched) > > > > > - /* > > > > > - * Add at the head of the queue to reflect it was the earliest > > > > > - * job extracted. > > > > > - */ > > > > > - list_add(&bad->list, &sched->pending_list); > > > > > - > > > > > - /* > > > > > * Iterate the job list from later to earlier one and either deactive > > > > > * their HW callbacks or remove them from pending list if they already > > > > > * signaled. > > > > > -- > > > > > 2.7.4 > > > > > > > > > > > > > -- > > > > Daniel Vetter > > > > Software Engineer, Intel Corporation > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > > > > ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76 > > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170 > > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL > > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLg > > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0 > > > > > > -- > > > Daniel Vetter > > > Software Engineer, Intel Corporation > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0 > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch