Re: [Intel-gfx] [RFC v2 0/5] Waitboost drm syncobj waits

Rob Clark <robdclark@xxxxxxxxx> · Fri, 17 Feb 2023 15:38:33 -0800

On Fri, Feb 17, 2023 at 12:45 PM Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote:
>
> On Fri, Feb 17, 2023 at 09:00:49AM -0800, Rob Clark wrote:
> > On Fri, Feb 17, 2023 at 8:03 AM Tvrtko Ursulin
> > <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
> > >
> > >
> > > On 17/02/2023 14:55, Rob Clark wrote:
> > > > On Fri, Feb 17, 2023 at 4:56 AM Tvrtko Ursulin
> > > > <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
> > > >>
> > > >>
> > > >> On 16/02/2023 18:19, Rodrigo Vivi wrote:
> > > >>> On Tue, Feb 14, 2023 at 11:14:00AM -0800, Rob Clark wrote:
> > > >>>> On Fri, Feb 10, 2023 at 5:07 AM Tvrtko Ursulin
> > > >>>> <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
> > > >>>>>
> > > >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > > >>>>>
> > > >>>>> In i915 we have this concept of "wait boosting" where we give a priority boost
> > > >>>>> for instance to fences which are actively waited upon from userspace. This has
> > > >>>>> it's pros and cons and can certainly be discussed at lenght. However fact is
> > > >>>>> some workloads really like it.
> > > >>>>>
> > > >>>>> Problem is that with the arrival of drm syncobj and a new userspace waiting
> > > >>>>> entry point it added, the waitboost mechanism was bypassed. Hence I cooked up
> > > >>>>> this mini series really (really) quickly to see if some discussion can be had.
> > > >>>>>
> > > >>>>> It adds a concept of "wait count" to dma fence, which is incremented for every
> > > >>>>> explicit dma_fence_enable_sw_signaling and dma_fence_add_wait_callback (like
> > > >>>>> dma_fence_add_callback but from explicit/userspace wait paths).
> > > >>>>
> > > >>>> I was thinking about a similar thing, but in the context of dma_fence
> > > >>>> (or rather sync_file) fd poll()ing.  How does the kernel differentiate
> > > >>>> between "housekeeping" poll()ers that don't want to trigger boost but
> > > >>>> simply know when to do cleanup, and waiters who are waiting with some
> > > >>>> urgency.  I think we could use EPOLLPRI for this purpose.
> > > >>>>
> > > >>>> Not sure how that translates to waits via the syncobj.  But I think we
> > > >>>> want to let userspace give some hint about urgent vs housekeeping
> > > >>>> waits.
> > > >>>
> > > >>> Should the hint be on the waits, or should the hints be on the executed
> > > >>> context?
> > > >>>
> > > >>> In the end we need some way to quickly ramp-up the frequency to avoid
> > > >>> the execution bubbles.
> > > >>>
> > > >>> waitboost is trying to guess that, but in some cases it guess wrong
> > > >>> and waste power.
> > > >>
> > > >> Do we have a list of workloads which shows who benefits and who loses
> > > >> from the current implementation of waitboost?
> > > >>> btw, this is something that other drivers might need:
> > > >>>
> > > >>> https://gitlab.freedesktop.org/drm/amd/-/issues/1500#note_825883
> > > >>> Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> > > >>
> > > >> I have several issues with the context hint if it would directly
> > > >> influence frequency selection in the "more power" direction.
> > > >>
> > > >> First of all, assume a context hint would replace the waitboost. Which
> > > >> applications would need to set it to restore the lost performance and
> > > >> how would they set it?
> > > >>
> > > >> Then I don't even think userspace necessarily knows. Think of a layer
> > > >> like OpenCL. It doesn't really know in advance the profile of
> > > >> submissions vs waits. It depends on the CPU vs GPU speed, so hardware
> > > >> generation, and the actual size of the workload which can be influenced
> > > >> by the application (or user) and not the library.
> > > >>
> > > >> The approach also lends itself well for the "arms race" where every
> > > >> application can say "Me me me, I am the most important workload there is!".
> > > >
> > > > since there is discussion happening in two places:
> > > >
> > > > https://gitlab.freedesktop.org/drm/intel/-/issues/8014#note_1777433
> > > >
> > > > What I think you might want is a ctx boost_mask which lets an app or
> > > > driver disable certain boost signals/classes.  Where fence waits is
> > > > one class of boost, but hypothetical other signals like touchscreen
> > > > (or other) input events could be another class of boost.  A compute
> > > > workload might be interested in fence wait boosts but could care less
> > > > about input events.
> > >
> > > I think it can only be apps which could have any chance knowing whether
> > > their use of a library is latency sensitive or not. Which means new
> > > library extensions and their adoption. So I have some strong reservation
> > > that route is feasible.
> > >
> > > Or we tie with priority which many drivers do. Normal and above gets the
> > > boosting and what lowered itself does not (aka SCHED_IDLE/SCHED_BATCH).
> >
> > yeah, that sounds reasonable.
> >
>
> on that gitlab-issue discussion Emma Anholt was against using the priority
> to influence frequency since that should be more about latency.
>
> or we are talking about something different priority here?

I was thinking to only _not_ boost on the lowest priority, but boost
on norm/high priority.

But not something I feel too strongly about.  Ie. deciding on policy
doesn't affect or need to block getting the dma_fence and syncobj
plumbing in place.

BR,
-R

> > > Related note is that we lack any external control of our scheduling
> > > decisions so we really do suck compared to other scheduling domains like
> > > CPU and IO etc.
> > >
> > > >> The last concern is for me shared with the proposal to expose deadlines
> > > >> or high priority waits as explicit uapi knobs. Both come under the "what
> > > >> application told us it will do" category vs what it actually does. So I
> > > >> think it is slightly weaker than basing decisions of waits.
> > > >>
> > > >> The current waitboost is a bit detached from that problem because when
> > > >> we waitboost for flips we _know_ it is an actual framebuffer in the flip
> > > >> chain. When we waitboost for waits we also know someone is waiting. We
> > > >> are not trusting userspace telling us this will be a buffer in the flip
> > > >> chain or that this is a context which will have a certain duty-cycle.
> > > >>
> > > >> But yes, even if the input is truthful, latter is still only a
> > > >> heuristics because nothing says all waits are important. AFAIU it just
> > > >> happened to work well in the past.
> > > >>
> > > >> I do understand I am effectively arguing for more heuristics, which may
> > > >> sound a bit against the common wisdom. This is because in general I
> > > >> think the logic to do the right thing, be it in the driver or in the
> > > >> firmware, can work best if it has a holistic view. Simply put it needs
> > > >> to have more inputs to the decisions it is making.
> > > >>
> > > >> That is what my series is proposing - adding a common signal of "someone
> > > >> in userspace is waiting". What happens with that signal needs not be
> > > >> defined (promised) in the uapi contract.
> > > >>
> > > >> Say you route it to SLPC logic. It doesn't need to do with it what
> > > >> legacy i915 is doing today. It just needs to do something which works
> > > >> best for majority of workloads. It can even ignore it if that works for it.
> > > >>
> > > >> Finally, back to the immediate problem is when people replace the OpenCL
> > > >> NEO driver with clvk that performance tanks. Because former does waits
> > > >> using i915 specific ioctls and so triggers waitboost, latter waits on
> > > >> syncobj so no waitboost and performance is bad. What short term solution
> > > >> can we come up with? Or we say to not use clvk? I also wonder if other
> > > >> Vulkan based stuff is perhaps missing those easy performance gains..
> > > >>
> > > >> Perhaps strictly speaking Rob's and mine proposal are not mutually
> > > >> exclusive. Yes I could piggy back on his with an "immediate deadline for
> > > >> waits" idea, but they could also be separate concepts if we concluded
> > > >> "someone is waiting" signal is useful to have. Or it takes to long to
> > > >> upstream the full deadline idea.
> > > >
> > > > Let me re-spin my series and add the syncobj wait flag and i915 bits
> > >
> > > I think wait flag is questionable unless it is inverted to imply waits
> > > which can be de-prioritized (again same parallel with SCHED_IDLE/BATCH).
> > > Having a flag which "makes things faster" IMO should require elevated
> > > privilege (to avoid the "arms race") at which point I fear it quickly
> > > becomes uninteresting.
> >
> > I guess you could make the argument in either direction.  Making the
> > default behavior ramp up clocks could be a power regression.
>
> yeap, exactly the media / video conference case.
>
> >
> > I also think the "arms race" scenario isn't really as much of a
> > problem as you think.  There aren't _that_ many things using the GPU
> > at the same time (compared to # of things using CPU).   And a lot of
> > mobile games throttle framerate to avoid draining your battery too
> > quickly (after all, if your battery is dead you can't keep buying loot
> > boxes or whatever).
>
> Very good point.
>
> And in the GPU case they rely a lot on the profiles. Which btw, seems
> to be the Radeon solution. They boost the freq if the high performance
> profile is selected and don't care about the execution bubbles if low
> or mid profiles are selected, or something like that.
>
> >
> > > > adapted from your patches..  I think the basic idea of deadlines
> > > > (which includes "I want it NOW" ;-)) isn't controversial, but the
> > > > original idea got caught up in some bikeshed (what about compositors
> > > > that wait on fences in userspace to decide which surfaces to update in
> > > > the next frame), plus me getting busy and generally not having a good
> > > > plan for how to leverage this from VM guests (which is becoming
> > > > increasingly important for CrOS).  I think I can build on some ongoing
> > > > virtgpu fencing improvement work to solve the latter.  But now that we
> > > > have a 2nd use-case for this, it makes sense to respin.
> > >
> > > Sure, I was looking at the old version already. It is interesting. But
> > > also IMO needs quite a bit more work to approach achieving what is
> > > implied from the name of the feature. It would need proper deadline
> > > based sched job picking, and even then drm sched is mostly just a
> > > frontend. So once past runnable status and jobs handed over to backend,
> > > without further driver work it probably wouldn't be very effective past
> > > very lightly loaded systems.
> >
> > Yes, but all of that is not part of dma_fence ;-)
> >
> > A pretty common challenging usecase is still the single fullscreen
> > game, where scheduling isn't the problem, but landing at an
> > appropriate GPU freq absolutely is.  (UI workloads are perhaps more
> > interesting from a scheduler standpoint, but they generally aren't
> > challenging from a load/freq standpoint.)
> >
> > Fwiw, the original motivation of the series was to implement something
> > akin to i915 pageflip boosting without having to abandon the atomic
> > helpers.  (And, I guess it would also let i915 preserve that feature
> > if it switched to atomic helpers.. I'm unsure if there are still other
> > things blocking i915's migration.)
> >
> > > Then if we fast forward to a world where schedulers perhaps become fully
> > > deadline aware (we even had this for i915 few years back) then the
> > > question will be does equating waits with immediate deadlines still
> > > works. Maybe not too well because we wouldn't have the ability to
> > > distinguish between the "someone is waiting" signal from the otherwise
> > > propagated deadlines.
> >
> > Is there any other way to handle a wait boost than expressing it as an
> > ASAP deadline?
> >
> > BR,
> > -R
> >
> > >
> > > Regards,
> > >
> > > Tvrtko