On 26/09/2019 16:48, Grodzovsky, Andrey wrote: > > On 9/26/19 11:23 AM, Steven Price wrote: >> On 26/09/2019 16:14, Grodzovsky, Andrey wrote: >>> On 9/26/19 10:16 AM, Steven Price wrote: >>>> drm_sched_cleanup_jobs() attempts to free finished jobs, however because >>>> it is called as the condition of wait_event_interruptible() it must not >>>> sleep. Unfortuantly some free callbacks (notibly for Panfrost) do sleep. >>>> >>>> Instead let's rename drm_sched_cleanup_jobs() to >>>> drm_sched_get_cleanup_job() and simply return a job for processing if >>>> there is one. The caller can then call the free_job() callback outside >>>> the wait_event_interruptible() where sleeping is possible before >>>> re-checking and returning to sleep if necessary. >>>> >>>> Signed-off-by: Steven Price <steven.price@xxxxxxx> >>>> --- >>>> Changes from v3: >>>> * drm_sched_main() re-arms the timeout for the next job after calling >>>> free_job() >>>> >>>> drivers/gpu/drm/scheduler/sched_main.c | 45 +++++++++++++++----------- >>>> 1 file changed, 26 insertions(+), 19 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 9a0ee74d82dc..148468447ba9 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -622,43 +622,41 @@ static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb) >>>> } >>>> >>>> /** >>>> - * drm_sched_cleanup_jobs - destroy finished jobs >>>> + * drm_sched_get_cleanup_job - fetch the next finished job to be destroyed >>>> * >>>> * @sched: scheduler instance >>>> * >>>> - * Remove all finished jobs from the mirror list and destroy them. >>>> + * Returns the next finished job from the mirror list (if there is one) >>>> + * ready for it to be destroyed. >>>> */ >>>> -static void drm_sched_cleanup_jobs(struct drm_gpu_scheduler *sched) >>>> +static struct drm_sched_job * >>>> +drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) >>>> { >>>> + struct drm_sched_job *job = NULL; >>>> unsigned long flags; >>>> >>>> /* Don't destroy jobs while the timeout worker is running */ >>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT && >>>> !cancel_delayed_work(&sched->work_tdr)) >>>> - return; >>>> - >>>> + return NULL; >>>> >>>> - while (!list_empty(&sched->ring_mirror_list)) { >>>> - struct drm_sched_job *job; >>>> + spin_lock_irqsave(&sched->job_list_lock, flags); >>>> >>>> - job = list_first_entry(&sched->ring_mirror_list, >>>> + job = list_first_entry_or_null(&sched->ring_mirror_list, >>>> struct drm_sched_job, node); >>>> - if (!dma_fence_is_signaled(&job->s_fence->finished)) >>>> - break; >>>> >>>> - spin_lock_irqsave(&sched->job_list_lock, flags); >>>> + if (job && dma_fence_is_signaled(&job->s_fence->finished)) { >>>> /* remove job from ring_mirror_list */ >>>> list_del_init(&job->node); >>>> - spin_unlock_irqrestore(&sched->job_list_lock, flags); >>>> - >>>> - sched->ops->free_job(job); >>>> + } else { >>>> + job = NULL; >>>> + /* queue timeout for next job */ >>>> + drm_sched_start_timeout(sched); >>>> } >>>> >>>> - /* queue timeout for next job */ >>>> - spin_lock_irqsave(&sched->job_list_lock, flags); >>>> - drm_sched_start_timeout(sched); >>>> spin_unlock_irqrestore(&sched->job_list_lock, flags); >>>> >>>> + return job; >>>> } >>>> >>>> /** >>>> @@ -698,12 +696,21 @@ static int drm_sched_main(void *param) >>>> struct drm_sched_fence *s_fence; >>>> struct drm_sched_job *sched_job; >>>> struct dma_fence *fence; >>>> + struct drm_sched_job *cleanup_job = NULL; >>>> >>>> wait_event_interruptible(sched->wake_up_worker, >>>> - (drm_sched_cleanup_jobs(sched), >>>> + (cleanup_job = drm_sched_get_cleanup_job(sched)) || >>>> (!drm_sched_blocked(sched) && >>>> (entity = drm_sched_select_entity(sched))) || >>>> - kthread_should_stop())); >>>> + kthread_should_stop()); >>>> + >>>> + while (cleanup_job) { >>>> + sched->ops->free_job(cleanup_job); >>>> + /* queue timeout for next job */ >>>> + drm_sched_start_timeout(sched); >>>> + >>>> + cleanup_job = drm_sched_get_cleanup_job(sched); >>>> + } >>> >>> Why drm_sched_start_timeout is called both here and inside >>> drm_sched_get_cleanup_job ? And also why call it multiple times in the >>> loop instead of only once after the loop is done ? >> Christian pointed out to be that the first thing >> drm_sched_get_cleanup_job does is call cancel_delayed_work(), and if >> that returns false then it bails out with a NULL return. So to actually >> get another job (if one exists) the timeout has to be restarted. > > > For this case where timeout work already in progress note that > drm_sched_job_timedout restarts the timeout in it's end so it should be > ok to restart the timeout unconditionally inside > drm_sched_get_cleanup_job as it was done before. I may have misinterpreted Christian[1], but I interpreted the below as meaning that he'd prefer the caller (drm_sched_main()) to handle the drm_sched_start_timeout() in this case: On 26/09/2019 14:50, Koenig, Christian wrote: >> Alternatively the caller could manually re-arm the timeout after >> handling the job free. > > I don't see anything that could go wrong immediately, but that is > probably the cleaner approach. [1] https://lore.kernel.org/lkml/d9b23ef8-f784-0bce-c34a-fa02002db1ea@xxxxxxx/ I wasn't entirely sure if it was safe to leave the timeout running while the job free was going on which would be the case if drm_sched_get_cleanup_job() re-armed unconditionally. >> >> It's also necessary to restart the timeout in the case where the return >> is NULL which is handled in the function itself. >> >> TBH I'm not sure whether this while loop is worth it - it may be better >> to replace it with simply: >> >> if (cleanup_job) { >> sched->ops->free_job(cleanup_job); >> /* queue timeout for next job */ >> drm_sched_start_timeout(sched); >> } >> >> The outer loop would then handle the next call to >> drm_sched_get_cleanup_job() as necessary. > > > What outer loop ? There's a loop round the code in drm_sched_main(): while (!kthread_should_stop()) { [...] wait_event_interruptible(...) while (cleanup_job) {...} if (!entity) continue; [...] } So after handling the cleanup_job case, the outer loop will loop back round to the wait_event_interruptible() call. One other concern is whether it's possible for the thread to be signalled to stop before the clean up has been completed. The while loop I added ensures that all jobs are actually cleaned up before kthread_should_stop() is checked. Steve _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel