Re: [PATCH] drm/amdgpu: guard ib scheduling while in reset

"Grodzovsky, Andrey" <Andrey.Grodzovsky@xxxxxxx> · Wed, 30 Oct 2019 15:05:27 +0000

I see.

OK, I will add to myself a TODO about struct completion approach.

Andrey

On 10/30/19 11:00 AM, Koenig, Christian wrote:
> Yeah, and exactly that's the problem :) You need a global lock covering
> all schedulers.
>
> Otherwise you end up in hell's kitchen again with taking all those locks
> in the right order.
>
> Christian.
>
> Am 30.10.19 um 15:56 schrieb Grodzovsky, Andrey:
>> Can you elaborate on what is the tricky part with the lock ? I assumed
>> we just use per scheduler lock.
>>
>> Andrey
>>
>> On 10/30/19 10:50 AM, Christian König wrote:
>>> A lock inside the scheduler is rather tricky to implement.
>>>
>>> What you need to do is to get rid of the park()/unpark() hack in
>>> drm_sched_entity_fini().
>>>
>>> We could do this with a struct completion or convert the scheduler
>>> from a thread to a work item.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 30.10.19 um 15:44 schrieb Grodzovsky, Andrey:
>>>> That good  as proof of RCA but I still think we should grab a dedicated
>>>> lock inside scheduler since the race is internal to scheduler code so
>>>> this better to handle it inside the scheduler code to make the fix apply
>>>> for all drivers using it.
>>>>
>>>> Andrey
>>>>
>>>> On 10/30/19 4:44 AM, S, Shirish wrote:
>>>>>>>> We still have it and isn't doing kthread_park()/unpark() from
>>>>>>>> drm_sched_entity_fini while GPU reset in progress defeats all the
>>>>>>>> purpose of drm_sched_stop->kthread_park ? If
>>>>>>>> drm_sched_entity_fini-> kthread_unpark happens AFTER
>>>>>>>> drm_sched_stop->kthread_park nothing prevents from another (third)
>>>>>>>> thread keep submitting job to HW which will be picked up by the
>>>>>>>> unparked scheduler thread try to submit to HW but fail because the
>>>>>>>> HW ring is deactivated.
>>>>>>>>
>>>>>>>> If so maybe we should serialize calls to
>>>>>>>> kthread_park/unpark(sched->thread) ?
>>>>>>>>
>>>>>>> Yeah, that was my thinking as well. Probably best to just grab the
>>>>>>> reset lock before calling drm_sched_entity_fini().
>>>>>> Shirish - please try locking &adev->lock_reset around calls to
>>>>>> drm_sched_entity_fini as Christian suggests and see if this actually
>>>>>> helps the issue.
>>>>>>
>>>>> Yes that also works.
>>>>>
>>>>> Regards,
>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx