Re: [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2021-06-29 7:24 a.m., Christian König wrote:

Am 29.06.21 um 13:18 schrieb Boris Brezillon:
Hi Christian,

On Tue, 29 Jun 2021 13:03:58 +0200
Christian König <christian.koenig@xxxxxxx> wrote:

Am 29.06.21 um 09:34 schrieb Boris Brezillon:
Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.
Well, we had already tried this and it didn't worked the way it is expected.

The major problem is that you not only want to serialize the queue, but
rather have a single reset for all queues.

Otherwise you schedule multiple resets for each hardware queue. E.g. for
your 3 hardware queues you would reset the GPU 3 times if all of them
time out at the same time (which is rather likely).

Using a single delayed work item doesn't work either because you then
only have one timeout.

What could be done is to cancel all delayed work items from all stopped
schedulers.
drm_sched_stop() does that already, and since we call drm_sched_stop()
on all queues in the timeout handler, we end up with only one global
reset happening even if several queues report a timeout at the same
time.

Ah, nice. Yeah, in this case it should indeed work as expected.

Feel free to add an Acked-by: Christian König <christian.koenig@xxxxxxx> to it.

Regards,
Christian.


Seems to me that for this to work we need to change cancel_delayed_work to cancel_delayed_work_sync so not only pending TO handlers  are cancelled but also any in progress are waited for and to to prevent rearming. Also move it right after kthread_park - before we start touching pending list.

Andrey




Regards,

Boris




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux