On Wed, Nov 10, 2021 at 1:50 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote: > > On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > > > On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote: > > > > I stumbled across this thread when I ran into the same issue, while > > > > working out how to move drm/msm to use scheduler's retire + > > > > timeout/recovery (and get rid of our own mirror list of in-flight > > > > jobs). We already have hw error detection enabled, and it can signal > > > > quite fast, so assuming the first job on the list is the guilty job > > > > just won't work. > > > > > > > > But I was considering a slightly different approach to fixing this, > > > > instead just handling it all in drm_sched_main() and getting rid of > > > > the complicated kthread parking gymnastics. Ie. something along the > > > > lines of: > > > > > > So handling timeouts in the main sched thread wont work as soon as you > > > have multiple engines and reset that impacts across engines: > > > > > > - Nothing is simplified since you still need to stop the other scheduler > > > threads. > > > > > > - You get deadlocks if 2 schedulers time out at the same time, and both > > > want to stop the other one. > > > > > > Hence workqueue. Now the rule for the wq is that you can only have one per > > > reset domain, so > > > - single engine you just take the one drm/sched provides > > > - if reset affects all your engines in the chip, then you allocate on in > > > the drm_device and pass that to all > > > - if you have a complex of gpus all interconnected (e.g. xgmi hive for > > > amd), then it's one wq for the entire hive > > > > > > _All_ reset related things must be run on that workqueue or things breaks, > > > which means if you get hw fault that also needs to be run there. I guess > > > we should either patch drm/sched to check you call that function from the > > > right workqueue, or just handle it internally. > > > > Hmm, ok.. I guess it would be useful to better document the reasoning > > for the current design, that would have steered me more towards the > > approach taken in this patch. > > Maybe this was because you worked on an old kernel? Boris did update the > kerneldoc as part of making gpu reset work for panfrost, which has this > multi-engine reset problem. If that's not yet clear then we need to > improve the docs further. I saw that, and understood the ordered wq.. but missed the implication regarding having to park other scheduler kthreads BR, -R > AMD's problem is even worse, because their reset domain is the entire xgmi > hive, so multiple pci devices. > > Also there might more issues in drm/sched ofc, e.g. I've looked a bit at > ordering/barriers and I'm pretty sure a lot are still missing. Or at least > we should have comments in the code explaining why it all works. > -Daniel > > > > > BR, > > -R > > > > > -Daniel > > > > > > > > > > > --------------------- > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > index 67382621b429..4d6ce775c316 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct > > > > drm_gpu_scheduler *sched) > > > > return false; > > > > } > > > > > > > > +static bool handle_timeout(struct drm_gpu_scheduler *sched) > > > > +{ > > > > + struct drm_sched_job *bad; > > > > + > > > > + if (!sched->has_timeout) > > > > + return false; > > > > + > > > > + sched->has_timeout = false; > > > > + > > > > + spin_lock(&sched->job_list_lock); > > > > + bad = list_first_entry_or_null(&sched->pending_list, > > > > + struct drm_sched_job, list); > > > > + > > > > + if (!bad) { > > > > + spin_unlock(&sched->job_list_lock); > > > > + return false; > > > > + } > > > > + > > > > + spin_unlock(&sched->job_list_lock); > > > > + > > > > + if (sched->timeout_wq == system_wq) { > > > > + /* > > > > + * If driver has no specific requirements about serializing > > > > + * reset wrt. other engines, just call timedout_job() directly > > > > + */ > > > > + sched->ops->timedout_job(job); > > > > + } else { > > > > + /* > > > > + * Otherwise queue it on timeout_wq and wait for it to complete > > > > + */ > > > > + ... more typing needed here ... > > > > + } > > > > + > > > > + if (sched->free_guilty) { > > > > + sched->ops->free_job(job); > > > > + sched->free_guilty = false; > > > > + } > > > > +} > > > > + > > > > /** > > > > * drm_sched_main - main scheduler thread > > > > * > > > > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param) > > > > > > > > wait_event_interruptible(sched->wake_up_worker, > > > > (cleanup_job = > > > > drm_sched_get_cleanup_job(sched)) || > > > > + handle_timeout(sched) || > > > > (!drm_sched_blocked(sched) && > > > > (entity = > > > > drm_sched_select_entity(sched))) || > > > > kthread_should_stop()); > > > > --------------------- > > > > > > > > drm_sched_fault() and the sw timeout handler would just set > > > > sched->has_timeout and kick sched->wake_up_worker. > > > > > > > > And since we handle the timeout case after > > > > drm_sched_get_cleanup_job(), we know that all of the successfully > > > > completed jobs have already been popped off the list, and won't be > > > > unfairly maligned. > > > > > > > > BR, > > > > -R > > > > > > > > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@xxxxxxx> wrote: > > > > > > > > > > [AMD Official Use Only] > > > > > > > > > > Okay, I will reprepare this patch > > > > > > > > > > Thanks > > > > > > > > > > ------------------------------------------ > > > > > Monk Liu | Cloud-GPU Core team > > > > > ------------------------------------------ > > > > > > > > > > -----Original Message----- > > > > > From: Daniel Vetter <daniel@xxxxxxxx> > > > > > Sent: Tuesday, August 31, 2021 9:02 PM > > > > > To: Liu, Monk <Monk.Liu@xxxxxxx> > > > > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Chen, Jingwen <Jingwen.Chen@xxxxxxx> > > > > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler > > > > > > > > > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote: > > > > > > Can we please have some actual commit message here, with detailed > > > > > > explanation of the race/bug/whatever, how you fix it and why this is > > > > > > the best option? > > > > > > > > > > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote: > > > > > > > tested-by: jingwen chen <jingwen.chen@xxxxxxx> > > > > > > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > > > > > > > Signed-off-by: jingwen chen <jingwen.chen@xxxxxxx> > > > > > > > --- > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 24 > > > > > > > ++++-------------------- > > > > > > > 1 file changed, 4 insertions(+), 20 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > index ecf8140..894fdb24 100644 > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work) > > > > > > > sched = container_of(work, struct drm_gpu_scheduler, > > > > > > > work_tdr.work); > > > > > > > > > > > > > > /* Protects against concurrent deletion in > > > > > > > drm_sched_get_cleanup_job */ > > > > > > > + if (!__kthread_should_park(sched->thread)) > > > > > > > > > > > > This is a __ function, i.e. considered internal, and it's lockless > > > > > > atomic, i.e. unordered. And you're not explaining why this works. > > > > > > > > > > > > Iow it's probably buggy, and an just unconditionally parking the > > > > > > kthread is probably the right thing to do. If it's not the right thing > > > > > > to do, there's a bug here for sure. > > > > > > > > > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much. > > > > > -Daniel > > > > > > > > > > > > + kthread_park(sched->thread); > > > > > > > + > > > > > > > spin_lock(&sched->job_list_lock); > > > > > > > job = list_first_entry_or_null(&sched->pending_list, > > > > > > > struct drm_sched_job, list); > > > > > > > > > > > > > > if (job) { > > > > > > > - /* > > > > > > > - * Remove the bad job so it cannot be freed by concurrent > > > > > > > - * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread > > > > > > > - * is parked at which point it's safe. > > > > > > > - */ > > > > > > > - list_del_init(&job->list); > > > > > > > spin_unlock(&sched->job_list_lock); > > > > > > > > > > > > > > + /* vendor's timeout_job should call drm_sched_start() */ > > > > > > > status = job->sched->ops->timedout_job(job); > > > > > > > > > > > > > > /* > > > > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) > > > > > > > kthread_park(sched->thread); > > > > > > > > > > > > > > /* > > > > > > > - * Reinsert back the bad job here - now it's safe as > > > > > > > - * drm_sched_get_cleanup_job cannot race against us and release the > > > > > > > - * bad job at this point - we parked (waited for) any in progress > > > > > > > - * (earlier) cleanups and drm_sched_get_cleanup_job will not be called > > > > > > > - * now until the scheduler thread is unparked. > > > > > > > - */ > > > > > > > - if (bad && bad->sched == sched) > > > > > > > - /* > > > > > > > - * Add at the head of the queue to reflect it was the earliest > > > > > > > - * job extracted. > > > > > > > - */ > > > > > > > - list_add(&bad->list, &sched->pending_list); > > > > > > > - > > > > > > > - /* > > > > > > > * Iterate the job list from later to earlier one and either deactive > > > > > > > * their HW callbacks or remove them from pending list if they already > > > > > > > * signaled. > > > > > > > -- > > > > > > > 2.7.4 > > > > > > > > > > > > > > > > > > > -- > > > > > > Daniel Vetter > > > > > > Software Engineer, Intel Corporation > > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > > > > > > ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76 > > > > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170 > > > > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL > > > > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLg > > > > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0 > > > > > > > > > > -- > > > > > Daniel Vetter > > > > > Software Engineer, Intel Corporation > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&reserved=0 > > > > > > -- > > > Daniel Vetter > > > Software Engineer, Intel Corporation > > > http://blog.ffwll.ch > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch