Re: blocking ops in drm_sched_cleanup_jobs()

"Koenig, Christian" <Christian.Koenig@xxxxxxx> · Tue, 24 Sep 2019 09:55:35 +0000

Sorry for the delayed response, have been busy on other stuff last week.

Am 17.09.19 um 14:46 schrieb Steven Price:
> On 17/09/2019 09:42, Koenig, Christian wrote:
>> Hi Steven,
>>
>> thought about that issue a bit more and I think I came up with a 
>> solution.
>>
>> What you could do is to split up drm_sched_cleanup_jobs() into two
>> functions.
>>
>> One that checks if jobs to be cleaned up are present and one which does
>> the actual cleanup.
>>
>> This way we could call drm_sched_cleanup_jobs() outside of the
>> wait_event_interruptible().
>
> Yes that seems like a good solution - there doesn't seem to be a good 
> reason why the actual job cleanup needs to be done within the 
> wait_event_interruptible() condition. I did briefly attempt that 
> before, but I couldn't work out exactly what the condition is which 
> should cause the wake (my initial attempt caused continuous wake-ups).

Basically you need something like the following:

1. Test is timeout worker is running:

if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
     !cancel_delayed_work(&sched->work_tdr))
         return false;

2. Test if there is any job ready to be cleaned up.

job = list_first_entry_or_null(&sched->ring_mirror_list, struct 
drm_sched_job, node);
if (!job || !dma_fence_is_signaled(&job->s_fence->finished))
     return false;

That should basically do it.

Regards,
Christian.

>
> I'm not back in the office until Monday, but I'll have another go then 
> at spitting the checking and the actual freeing of the jobs.
>
> Thanks,
>
> Steve
>
>>
>> Regards,
>> Christian.
>>
>> Am 13.09.19 um 16:50 schrieb Steven Price:
>>> Hi,
>>>
>>> I hit the below splat randomly with panfrost. From what I can tell this
>>> is a more general issue which would affect other drivers.
>>>
>>> ----8<-----
>>> [58604.913130] ------------[ cut here ]------------
>>> [58604.918590] WARNING: CPU: 1 PID: 1758 at kernel/sched/core.c:6556 
>>> __might_sleep+0x74/0x98
>>> [58604.927965] do not call blocking ops when !TASK_RUNNING; state=1 
>>> set at [<0c590494>] prepare_to_wait_event+0x104/0x164
>>> [58604.940047] Modules linked in: panfrost gpu_sched
>>> [58604.945370] CPU: 1 PID: 1758 Comm: pan_js Not tainted 5.3.0-rc1+ #13
>>> [58604.952500] Hardware name: Rockchip (Device Tree)
>>> [58604.957815] [<c0111150>] (unwind_backtrace) from [<c010c99c>] 
>>> (show_stack+0x10/0x14)
>>> [58604.966521] [<c010c99c>] (show_stack) from [<c07adbb4>] 
>>> (dump_stack+0x9c/0xd4)
>>> [58604.974639] [<c07adbb4>] (dump_stack) from [<c0121da8>] 
>>> (__warn+0xe8/0x104)
>>> [58604.982462] [<c0121da8>] (__warn) from [<c0121e08>] 
>>> (warn_slowpath_fmt+0x44/0x6c)
>>> [58604.990867] [<c0121e08>] (warn_slowpath_fmt) from [<c014eccc>] 
>>> (__might_sleep+0x74/0x98)
>>> [58604.999973] [<c014eccc>] (__might_sleep) from [<c07c73d8>] 
>>> (__mutex_lock+0x38/0x948)
>>> [58605.008690] [<c07c73d8>] (__mutex_lock) from [<c07c7d00>] 
>>> (mutex_lock_nested+0x18/0x20)
>>> [58605.017841] [<c07c7d00>] (mutex_lock_nested) from [<bf00b54c>] 
>>> (panfrost_gem_free_object+0x60/0x10c [panfrost])
>>> [58605.029430] [<bf00b54c>] (panfrost_gem_free_object [panfrost]) 
>>> from [<bf00cecc>] (panfrost_job_put+0x138/0x150 [panfrost])
>>> [58605.042076] [<bf00cecc>] (panfrost_job_put [panfrost]) from 
>>> [<bf00121c>] (drm_sched_cleanup_jobs+0xc8/0xe0 [gpu_sched])
>>> [58605.054417] [<bf00121c>] (drm_sched_cleanup_jobs [gpu_sched]) 
>>> from [<bf001300>] (drm_sched_main+0xcc/0x26c [gpu_sched])
>>> [58605.066620] [<bf001300>] (drm_sched_main [gpu_sched]) from 
>>> [<c0146cfc>] (kthread+0x13c/0x154)
>>> [58605.076226] [<c0146cfc>] (kthread) from [<c01010b4>] 
>>> (ret_from_fork+0x14/0x20)
>>> [58605.084346] Exception stack(0xe959bfb0 to 0xe959bff8)
>>> [58605.090046] bfa0: 00000000 00000000 00000000 00000000
>>> [58605.099250] bfc0: 00000000 00000000 00000000 00000000 00000000 
>>> 00000000 00000000 00000000
>>> [58605.108480] bfe0: 00000000 00000000 00000000 00000000 00000013 
>>> 00000000
>>> [58605.116210] irq event stamp: 179
>>> [58605.119955] hardirqs last  enabled at (187): [<c017f7e4>] 
>>> console_unlock+0x564/0x5c4
>>> [58605.128935] hardirqs last disabled at (202): [<c017f308>] 
>>> console_unlock+0x88/0x5c4
>>> [58605.137788] softirqs last  enabled at (216): [<c0102334>] 
>>> __do_softirq+0x18c/0x548
>>> [58605.146543] softirqs last disabled at (227): [<c0129528>] 
>>> irq_exit+0xc4/0x10c
>>> [58605.154618] ---[ end trace f65bdbd9ea9adfc0 ]---
>>> ----8<-----
>>>
>>> The problem is that drm_sched_main() calls drm_sched_cleanup_jobs() as
>>> part of the condition of wait_event_interruptible:
>>>
>>>> wait_event_interruptible(sched->wake_up_worker,
>>>>                      (drm_sched_cleanup_jobs(sched),
>>>>                      (!drm_sched_blocked(sched) &&
>>>>                       (entity = drm_sched_select_entity(sched))) ||
>>>>                      kthread_should_stop()));
>>> When drm_sched_cleanup_jobs() is called *after* a wait (i.e. after
>>> prepare_to_wait_event() has been called), then any might_sleep() will
>>> moan loudly about it. This doesn't seem to happen often (I've only
>>> triggered it once) because usually drm_sched_cleanup_jobs() either
>>> doesn't sleep or does the sleeping during the first call that
>>> wait_event_interruptible() makes (which is before the task state is 
>>> set).
>>>
>>> I don't really understand why drm_sched_cleanup_jobs() needs to be
>>> called here, a simple change like below 'fixes' it. But I presume
>>> there's some reason for the call being part of the
>>> wait_event_interruptible condition. Can anyone shed light on this?
>>>
>>> The code was introduced in commit 5918045c4ed4 ("drm/scheduler: 
>>> rework job destruction")
>>>
>>> Steve
>>>
>>> ----8<-----
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 9a0ee74d82dc..528f295e3a31 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -699,11 +699,12 @@ static int drm_sched_main(void *param)
>>>            struct drm_sched_job *sched_job;
>>>            struct dma_fence *fence;
>>>    +        drm_sched_cleanup_jobs(sched);
>>> +
>>>            wait_event_interruptible(sched->wake_up_worker,
>>> -                     (drm_sched_cleanup_jobs(sched),
>>>                         (!drm_sched_blocked(sched) &&
>>>                          (entity = drm_sched_select_entity(sched))) ||
>>> -                     kthread_should_stop()));
>>> +                     kthread_should_stop());
>>>               if (!entity)
>>>                continue;
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@xxxxxxxxxxxxxxxxxxxxx
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel