Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/12/23 17:13, Boris Brezillon wrote:
On Tue, 12 Sep 2023 16:49:09 +0200
Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:

On Tue, 12 Sep 2023 16:33:01 +0200
Danilo Krummrich <dakr@xxxxxxxxxx> wrote:

On 9/12/23 16:28, Boris Brezillon wrote:
On Thu, 17 Aug 2023 13:13:31 +0200
Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
I think that's a misunderstanding. I'm not trying to say that it is
*always* beneficial to fill up the ring as much as possible. But I think
it is under certain circumstances, exactly those circumstances I
described for Nouveau.

As mentioned, in Nouveau the size of a job is only really limited by the
ring size, which means that one job can (but does not necessarily) fill
up the whole ring. We both agree that this is inefficient, because it
potentially results into the HW run dry due to hw_submission_limit == 1.

I recognize you said that one should define hw_submission_limit and
adjust the other parts of the equation accordingly, the options I see are:

(1) Increase the ring size while keeping the maximum job size.
(2) Decrease the maximum job size while keeping the ring size.
(3) Let the scheduler track the actual job size rather than the maximum
job size.

(1) results into potentially wasted ring memory, because we're not
always reaching the maximum job size, but the scheduler assumes so.

(2) results into more IOCTLs from userspace for the same amount of IBs
and more jobs result into more memory allocations and more work being
submitted to the workqueue (with Matt's patches).

(3) doesn't seem to have any of those draw backs.

What would be your take on that?

Actually, if none of the other drivers is interested into a more precise
way of keeping track of the ring utilization, I'd be totally fine to do
it in a driver specific way. However, unfortunately I don't see how this
would be possible.

I'm not entirely sure, but I think PowerVR is pretty close to your
description: jobs size is dynamic size, and the ring buffer size is
picked by the driver at queue initialization time. What we did was to
set hw_submission_limit to an arbitrarily high value of 64k (we could
have used something like ringbuf_size/min_job_size instead), and then
have the control flow implemented with ->prepare_job() [1] (CCCB is the
PowerVR ring buffer). This allows us to maximize ring buffer utilization
while still allowing dynamic-size jobs.

I guess this would work, but I think it would be better to bake this in,
especially if more drivers do have this need. I already have an
implementation [1] for doing that in the scheduler. My plan was to push
that as soon as Matt sends out V3.

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95

PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
native fences waits are passed to the FW, and those add to the job size.
When we know our job is ready for execution (all non-native deps are
signaled), we evict already signaled native-deps (or native fences) to
shrink the job size further more, but that's something we need to
calculate late if we want the job size to be minimal. Of course, we can
always over-estimate the job size, but if we go for a full-blown
drm_sched integration, I wonder if it wouldn't be preferable to have a
->get_job_size() callback returning the number of units needed by job,
and have the core pick 1 when the hook is not implemented.

FWIW, I think last time I asked how to do that, I've been pointed to
->prepare_job() by someone  (don't remember if it was Daniel or
Christian), hence the PowerVR implementation. If that's still the
preferred solution, there's some opportunity to have a generic layer to
automate ringbuf utilization tracking and some helpers to prepare
wait_for_ringbuf dma_fences that drivers could return from
->prepare_job() (those fences would then be signaled when the driver
calls drm_ringbuf_job_done() and the next job waiting for ringbuf space
now fits in the ringbuf).


Not sure I like that, it's basically a different implementation to work
around limitations of an implementation that is supposed to cover this case
in general.




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux