Re: [PATCH] drm/amdgpu: Enable tunneling on high-priority compute queues

Alex Deucher <alexdeucher@xxxxxxxxx> · Fri, 8 Dec 2023 12:59:54 -0500

On Fri, Dec 8, 2023 at 12:27 PM Joshua Ashton <joshua@xxxxxxxxx> wrote:
>
> FWIW, we are shipping this right now in SteamOS Preview channel
> (probably going to Stable soon) and it seems to be working as expected
> and fixing issues there in instances we need to composite, compositor
> work we are forced to do would take longer than the compositor redzone
> to vblank.
>
> Previously in high gfx workloads like Cyberpunk using 100% of the GPU,
> we would consistently miss the deadline as composition could take
> anywhere from 2-6ms fairly randomly.
>
> Now it seems the time for the compositor's work to complete is pretty
> consistent and well in-time in gpuvis for every frame.

I was mostly just trying to look up the information to verify that it
was set up correctly, but I guess Marek already did and provided you
with that info, so it's probably fine as is.

>
> The only times we are not meeting deadline now is when there is an
> application using very little GPU and finishes incredibly quick, and the
> compositor is doing significantly more work (eg. FSR from 800p -> 4K or
> whatever), but that's a separate problem that can likely be solved by
> inlining some of the composition work with the client's dmabuf work if
> it has focus to avoid those clock bubbles.
>
> I heard some musings about dmabuf deadline kernel work recently, but not
> sure if any of that is applicable to AMD.

I think something like a workload hint would be more useful.  We did a
few patch sets to allow userspace to provide a hint to the kernel
about the workload type so the kernel could adjust the power
management heuristics accordingly, but there were concerns that the
UMDs would have to maintain application lists to select which
heuristic worked best for each application.  Maybe it would be better
to provide a general classification?  E.g., if the GL or vulkan app
uses these extensions, it's probably a compute type application vs
something more graphics-y.  The usual trade-off between power and
performance.  In general, just letting the firmware pick the clock
based on perf counters generally seems to work the best.  Maybe a
general workload hint set by the compositor based on the content type
it's displaying would be a better option (video vs gaming vs desktop)?

The deadline stuff doesn't really align well with what we can do with
our firmware and seems ripe for abuse.  Apps can just ask for high
clocks all the time which is great for performance, but not great for
power.  Plus there is not much room for anything other than max clocks
since you don't know how big the workload is or which clocks are the
limiting factor.

Alex

>
> - Joshie 🐸✨
>
> On 12/8/23 15:33, Marek Olšák wrote:
> > On Fri, Dec 8, 2023 at 9:57 AM Christian König <christian.koenig@xxxxxxx
> > <mailto:christian.koenig@xxxxxxx>> wrote:
> >
> >     Am 08.12.23 um 12:43 schrieb Friedrich Vock:
> >      > On 08.12.23 10:51, Christian König wrote:
> >      >> Well longer story short Alex and I have been digging up the
> >      >> documentation for this and as far as we can tell this isn't correct.
> >      > Huh. I initially talked to Marek about this, adding him in Cc.
> >
> >     Yeah, from the userspace side all you need to do is to set the bit as
> >     far as I can tell.
> >
> >      >>
> >      >> You need to do quite a bit more before you can turn on this feature.
> >      >> What userspace side do you refer to?
> >      > I was referring to the Mesa merge request I made
> >      > (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26462
> >     <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26462>).
> >      > If/When you have more details about what else needs to be done, feel
> >      > free to let me know.
> >
> >     For example from the hardware specification explicitly states that the
> >     kernel driver should make sure that only one app/queue is using this at
> >     the same time. That might work for now since we should only have a
> >     single compute priority queue, but we are not 100% sure yet.
> >
> >
> > This is incorrect. While the hw documentation says it's considered
> > "unexpected programming", it also says that the hardware algorithm
> > handles it correctly and it describes what happens in this case:
> > Tunneled waves from different queues are treated as equal.
> >
> > Marek
>