Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Tue, 12 Sep 2023 17:13:22 +0200



On Tue, 12 Sep 2023 16:49:09 +0200
Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:

> On Tue, 12 Sep 2023 16:33:01 +0200
> Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
> 
> > On 9/12/23 16:28, Boris Brezillon wrote:  
> > > On Thu, 17 Aug 2023 13:13:31 +0200
> > > Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
> > >     
> > >> I think that's a misunderstanding. I'm not trying to say that it is
> > >> *always* beneficial to fill up the ring as much as possible. But I think
> > >> it is under certain circumstances, exactly those circumstances I
> > >> described for Nouveau.
> > >>
> > >> As mentioned, in Nouveau the size of a job is only really limited by the
> > >> ring size, which means that one job can (but does not necessarily) fill
> > >> up the whole ring. We both agree that this is inefficient, because it
> > >> potentially results into the HW run dry due to hw_submission_limit == 1.
> > >>
> > >> I recognize you said that one should define hw_submission_limit and
> > >> adjust the other parts of the equation accordingly, the options I see are:
> > >>
> > >> (1) Increase the ring size while keeping the maximum job size.
> > >> (2) Decrease the maximum job size while keeping the ring size.
> > >> (3) Let the scheduler track the actual job size rather than the maximum
> > >> job size.
> > >>
> > >> (1) results into potentially wasted ring memory, because we're not
> > >> always reaching the maximum job size, but the scheduler assumes so.
> > >>
> > >> (2) results into more IOCTLs from userspace for the same amount of IBs
> > >> and more jobs result into more memory allocations and more work being
> > >> submitted to the workqueue (with Matt's patches).
> > >>
> > >> (3) doesn't seem to have any of those draw backs.
> > >>
> > >> What would be your take on that?
> > >>
> > >> Actually, if none of the other drivers is interested into a more precise
> > >> way of keeping track of the ring utilization, I'd be totally fine to do
> > >> it in a driver specific way. However, unfortunately I don't see how this
> > >> would be possible.    
> > > 
> > > I'm not entirely sure, but I think PowerVR is pretty close to your
> > > description: jobs size is dynamic size, and the ring buffer size is
> > > picked by the driver at queue initialization time. What we did was to
> > > set hw_submission_limit to an arbitrarily high value of 64k (we could
> > > have used something like ringbuf_size/min_job_size instead), and then
> > > have the control flow implemented with ->prepare_job() [1] (CCCB is the
> > > PowerVR ring buffer). This allows us to maximize ring buffer utilization
> > > while still allowing dynamic-size jobs.    
> > 
> > I guess this would work, but I think it would be better to bake this in,
> > especially if more drivers do have this need. I already have an
> > implementation [1] for doing that in the scheduler. My plan was to push
> > that as soon as Matt sends out V3.
> > 
> > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95  
> 
> PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
> native fences waits are passed to the FW, and those add to the job size.
> When we know our job is ready for execution (all non-native deps are
> signaled), we evict already signaled native-deps (or native fences) to
> shrink the job size further more, but that's something we need to
> calculate late if we want the job size to be minimal. Of course, we can
> always over-estimate the job size, but if we go for a full-blown
> drm_sched integration, I wonder if it wouldn't be preferable to have a
> ->get_job_size() callback returning the number of units needed by job,  
> and have the core pick 1 when the hook is not implemented.

FWIW, I think last time I asked how to do that, I've been pointed to
->prepare_job() by someone  (don't remember if it was Daniel or
Christian), hence the PowerVR implementation. If that's still the
preferred solution, there's some opportunity to have a generic layer to
automate ringbuf utilization tracking and some helpers to prepare
wait_for_ringbuf dma_fences that drivers could return from
->prepare_job() (those fences would then be signaled when the driver
calls drm_ringbuf_job_done() and the next job waiting for ringbuf space
now fits in the ringbuf).