On Thu, Feb 6, 2020 at 6:50 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 06.02.20 um 12:10 schrieb Lucas Stach: > > Hi all, > > > > On Mi, 2020-02-05 at 19:24 +0100, Lucas Stach wrote: > >> Hi Andrey, > >> > >> This commit breaks all drivers, which may bail out of the timeout > >> processing as they wish to extend the timeout (etnaviv, v3d). > >> > >> Those drivers currently just return from the timeout handler before > >> calling drm_sched_stop(), which means with this commit applied we are > >> removing the first job from the ring_mirror_list, but never put it > >> back. This leads to jobs getting lost from the ring mirror, which then > >> causes quite a bit of fallout like unsignaled fences. > >> > >> Not sure yet what to do about it, we can either add a function to add > >> the job back to the ring_mirror if the driver wants to extend the > >> timeout, or we could look for another way to stop > >> drm_sched_cleanup_jobs from freeing jobs that are currently in timeout > >> processing. > > So after thinking about this a bit more my opinion is that we need to > > revert this change for now and go back to the drawing board for the > > scheduler timeout handling. > > > > Right now this starts to feel like a big midlayer mistake with all the > > very intricate intertwining between the drivers and the scheduler. The > > rules on when it's safe to manipulate the ring mirror and when > > completed jobs are signaled and freed are not really well specified. > > The fact that we need to mutate state in order to get rid of races > > instead of having a single big "timeout processing is owner of the > > scheduler state for now" is a big fat warning sign IMHO. > > Yes, that strongly feels like a hack to me as well. But I didn't had > time and still haven't to take a closer look and suggest something better. > In that case, can someone send me a revert? Alex > Christian. > > > > > It took me far longer than I'd like to admit to understand the failure > > mode with fences not getting signaled after a GPU hang. The back and > > forth between scheduler and driver code makes things really hard to > > follow. > > > > Regards, > > Lucas > > > >> Regards, > >> Lucas > >> > >> On Mo, 2019-11-25 at 15:51 -0500, Andrey Grodzovsky wrote: > >>> Problem: > >>> Due to a race between drm_sched_cleanup_jobs in sched thread and > >>> drm_sched_job_timedout in timeout work there is a possiblity that > >>> bad job was already freed while still being accessed from the > >>> timeout thread. > >>> > >>> Fix: > >>> Instead of just peeking at the bad job in the mirror list > >>> remove it from the list under lock and then put it back later when > >>> we are garanteed no race with main sched thread is possible which > >>> is after the thread is parked. > >>> > >>> v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs. > >>> > >>> v3: Rebase on top of drm-misc-next. v2 is not needed anymore as > >>> drm_sched_get_cleanup_job already has a lock there. > >>> > >>> v4: Fix comments to relfect latest code in drm-misc. > >>> > >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > >>> Reviewed-by: Christian König <christian.koenig@xxxxxxx> > >>> Tested-by: Emily Deng <Emily.Deng@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++++ > >>> 1 file changed, 27 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > >>> index 6774955..1bf9c40 100644 > >>> --- a/drivers/gpu/drm/scheduler/sched_main.c > >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c > >>> @@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct work_struct *work) > >>> unsigned long flags; > >>> > >>> sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); > >>> + > >>> + /* Protects against concurrent deletion in drm_sched_get_cleanup_job */ > >>> + spin_lock_irqsave(&sched->job_list_lock, flags); > >>> job = list_first_entry_or_null(&sched->ring_mirror_list, > >>> struct drm_sched_job, node); > >>> > >>> if (job) { > >>> + /* > >>> + * Remove the bad job so it cannot be freed by concurrent > >>> + * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread > >>> + * is parked at which point it's safe. > >>> + */ > >>> + list_del_init(&job->node); > >>> + spin_unlock_irqrestore(&sched->job_list_lock, flags); > >>> + > >>> job->sched->ops->timedout_job(job); > >>> > >>> /* > >>> @@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct work_struct *work) > >>> job->sched->ops->free_job(job); > >>> sched->free_guilty = false; > >>> } > >>> + } else { > >>> + spin_unlock_irqrestore(&sched->job_list_lock, flags); > >>> } > >>> > >>> spin_lock_irqsave(&sched->job_list_lock, flags); > >>> @@ -370,6 +383,20 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) > >>> kthread_park(sched->thread); > >>> > >>> /* > >>> + * Reinsert back the bad job here - now it's safe as > >>> + * drm_sched_get_cleanup_job cannot race against us and release the > >>> + * bad job at this point - we parked (waited for) any in progress > >>> + * (earlier) cleanups and drm_sched_get_cleanup_job will not be called > >>> + * now until the scheduler thread is unparked. > >>> + */ > >>> + if (bad && bad->sched == sched) > >>> + /* > >>> + * Add at the head of the queue to reflect it was the earliest > >>> + * job extracted. > >>> + */ > >>> + list_add(&bad->node, &sched->ring_mirror_list); > >>> + > >>> + /* > >>> * Iterate the job list from later to earlier one and either deactive > >>> * their HW callbacks or remove them from mirror list if they already > >>> * signaled. > >> _______________________________________________ > >> dri-devel mailing list > >> dri-devel@xxxxxxxxxxxxxxxxxxxxx > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > _______________________________________________ > > amd-gfx mailing list > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/dri-devel _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx