Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color pipeline is needed

Xaver Hugl <xaver.hugl@xxxxxxx> · Fri, 27 Oct 2023 14:59:54 +0200

I'm afraid that would not be very useful. It indeed depends on the 
refresh rate, but also on how close to vblank the compositor does its 
commits / on what the latency requirements for the currently shown 
content are.
When the compositor presents a fullscreen video 
with frames that are queued up in advance, needing a full frame to 
program the atomic commit could be acceptable, but when the user moves 
the cursor or plays a game, the compositor needs to do the commits as 
close to vblank as possible. Without a known upper bound on the time 
that it takes to program the hardware that's not doable.

Am Fr., 27. Okt. 2023 um 14:01 Uhr schrieb Pekka Paalanen <ppaalanen@xxxxxxxxx>:
On Fri, 27 Oct 2023 12:01:32 +0200

Sebastian Wick <sebastian.wick@xxxxxxxxxx> wrote:

> On Fri, Oct 27, 2023 at 10:59:25AM +0200, Michel Dänzer wrote:

> > On 10/26/23 21:25, Alex Goins wrote:  

> > > On Thu, 26 Oct 2023, Sebastian Wick wrote:  

> > >> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote:  

> > >>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT)

> > >>> Alex Goins <agoins@xxxxxxxxxx> wrote:

> > >>>  

> > >>>> Despite being programmable, the LUTs are updated in a manner that is less

> > >>>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful

> > >>>> if there was some way to tag operations according to their performance,

> > >>>> for example so that clients can prefer a high performance one when they

> > >>>> intend to do an animated transition? I recall from the XDC HDR workshop

> > >>>> that this is also an issue with AMD's 3DLUT, where updates can be too

> > >>>> slow to animate.  

> > >>>

> > >>> I can certainly see such information being useful, but then we need to

> > >>> somehow quantize the performance.  

> > > 

> > > Right, which wouldn't even necessarily be universal, could depend on the given

> > > host, GPU, etc. It could just be a relative performance indication, to give an

> > > order of preference. That wouldn't tell you if it can or can't be animated, but

> > > when choosing between two LUTs to animate you could prefer the higher

> > > performance one.

> > >   

> > >>>

> > >>> What I was left puzzled about after the XDC workshop is that is it

> > >>> possible to pre-load configurations in the background (slow), and then

> > >>> quickly switch between them? Hardware-wise I mean.  

> > > 

> > > This works fine for our "fast" LUTs, you just point them to a surface in video

> > > memory and they flip to it. You could keep multiple surfaces around and flip

> > > between them without having to reprogram them in software. We can easily do that

> > > with enumerated curves, populating them when the driver initializes instead of

> > > waiting for the client to request them. You can even point multiple hardware

> > > LUTs to the same video memory surface, if they need the same curve.

> > >   

> > >>

> > >> We could define that pipelines with a lower ID are to be preferred over

> > >> higher IDs.  

> > > 

> > > Sure, but this isn't just an issue with a pipeline as a whole, but the

> > > individual elements within it and how to use them in a given context.

> > >   

> > >>

> > >> The issue is that if programming a pipeline becomes too slow to be

> > >> useful it probably should just not be made available to user space.  

> > > 

> > > It's not that programming the pipeline is overall too slow. The LUTs we have

> > > that are relatively slow to program are meant to be set infrequently, or even

> > > just once, to allow the scaler and tone mapping operator to operate in fixed

> > > point PQ space. You might still want the tone mapper, so you would choose a

> > > pipeline that includes them, but when it comes to e.g. animating a night light,

> > > you would want to choose a different LUT for that purpose.

> > >   

> > >>

> > >> The prepare-commit idea for blob properties would help to make the

> > >> pipelines usable again, but until then it's probably a good idea to just

> > >> not expose those pipelines.  

> > > 

> > > The prepare-commit idea actually wouldn't work for these LUTs, because they are

> > > programmed using methods instead of pointing them to a surface. I'm actually not

> > > sure how slow it actually is, would need to benchmark it. I think not exposing

> > > them at all would be overkill, since it would mean you can't use the preblending

> > > scaler or tonemapper, and animation isn't necessary for that.

> > > 

> > > The AMD 3DLUT is another example of a LUT that is slow to update, and it would

> > > obviously be a major loss if that wasn't exposed. There just needs to be some

> > > way for clients to know if they are going to kill performance by trying to

> > > change it every frame.  

> > 

> > Might a first step be to require the ALLOW_MODESET flag to be set when changing the values for a colorop which is too slow to be updated per refresh cycle?

> > 

> > This would tell the compositor: You can use this colorop, but you can't change its values on the fly.  

> 

> I argued before that changing any color op to passthrough should never

> require ALLOW_MODESET and while this is really hard to guarantee from a

> driver perspective I still believe that it's better to not expose any

> feature requiring ALLOW_MODESET or taking too long to program to be

> useful for per-frame changes.

> 

> When user space has ways to figure out if going back to a specific state

> (in this case setting everything to bypass) without ALLOW_MODESET we can

> revisit this decision, but until then, let's keep things simple and only

> expose things that work reliably without ALLOW_MODESET and fast enough

> to work for per-frame changes.

> 

> Harry, Pekka: Should we document this? It obviously restricts what can

> be exposed but exposing things that can't be used by user space isn't

> useful.

In an ideal world... but in real world, I don't know.

Would it help if there was a list collected, with all the things in

various hardware that is known to be too heavy to reprogram every

refresh? Maybe that would allow a more educated decision?

I bet that depends also on the refresh rate.

I would probably be fine with some sort of update cost classification

on colorops, and the kernel keeping track of blobs: if userspace sets

the same blob on the same colorop that is already there (by blob ID, no

need to compare contents), then it's a no-op change.

Anyway, I really like reading Alex Goins' reply, it seems we are very

much on the same page here. :-)

Thanks,

pq