On Tue, 13 Jun 2023 12:29:55 -0400 Christopher Braga <quic_cbraga@xxxxxxxxxxx> wrote: > On 6/13/2023 4:23 AM, Pekka Paalanen wrote: > > On Mon, 12 Jun 2023 12:56:57 -0400 > > Christopher Braga <quic_cbraga@xxxxxxxxxxx> wrote: > > > >> On 6/12/2023 5:21 AM, Pekka Paalanen wrote: > >>> On Fri, 9 Jun 2023 19:11:25 -0400 > >>> Christopher Braga <quic_cbraga@xxxxxxxxxxx> wrote: > >>> > >>>> On 6/9/2023 12:30 PM, Simon Ser wrote: > >>>>> Hi Christopher, > >>>>> > >>>>> On Friday, June 9th, 2023 at 17:52, Christopher Braga <quic_cbraga@xxxxxxxxxxx> wrote: > >>>>> > >>>>>>> The new COLOROP objects also expose a number of KMS properties. Each has a > >>>>>>> type, a reference to the next COLOROP object in the linked list, and other > >>>>>>> type-specific properties. Here is an example for a 1D LUT operation: > >>>>>>> > >>>>>>> Color operation 42 > >>>>>>> ├─ "type": enum {Bypass, 1D curve} = 1D curve > >>>>>>> ├─ "1d_curve_type": enum {LUT, sRGB, PQ, BT.709, HLG, …} = LUT > >>>>>> The options sRGB / PQ / BT.709 / HLG would select hard-coded 1D > >>>>>> curves? Will different hardware be allowed to expose a subset of these > >>>>>> enum values? > >>>>> > >>>>> Yes. Only hardcoded LUTs supported by the HW are exposed as enum entries. > >>>>> > >>>>>>> ├─ "lut_size": immutable range = 4096 > >>>>>>> ├─ "lut_data": blob > >>>>>>> └─ "next": immutable color operation ID = 43 > >>>>>>> > >>>>>> Some hardware has per channel 1D LUT values, while others use the same > >>>>>> LUT for all channels. We will definitely need to expose this in the > >>>>>> UAPI in some form. > >>>>> > >>>>> Hm, I was assuming per-channel 1D LUTs here, just like the existing GAMMA_LUT/ > >>>>> DEGAMMA_LUT properties work. If some hardware can't support that, it'll need > >>>>> to get exposed as another color operation block. > >>>>> > >>>>>>> To configure this hardware block, user-space can fill a KMS blob with > >>>>>>> 4096 u32 > >>>>>>> entries, then set "lut_data" to the blob ID. Other color operation types > >>>>>>> might > >>>>>>> have different properties. > >>>>>>> > >>>>>> The bit-depth of the LUT is an important piece of information we should > >>>>>> include by default. Are we assuming that the DRM driver will always > >>>>>> reduce the input values to the resolution supported by the pipeline? > >>>>>> This could result in differences between the hardware behavior > >>>>>> and the shader behavior. > >>>>>> > >>>>>> Additionally, some pipelines are floating point while others are fixed. > >>>>>> How would user space know if it needs to pack 32 bit integer values vs > >>>>>> 32 bit float values? > >>>>> > >>>>> Again, I'm deferring to the existing GAMMA_LUT/DEGAMMA_LUT. These use a common > >>>>> definition of LUT blob (u16 elements) and it's up to the driver to convert. > >>>>> > >>>>> Using a very precise format for the uAPI has the nice property of making the > >>>>> uAPI much simpler to use. User-space sends high precision data and it's up to > >>>>> drivers to map that to whatever the hardware accepts. > >>>>> > >>>> Conversion from a larger uint type to a smaller type sounds low effort, > >>>> however if a block works in a floating point space things are going to > >>>> get messy really quickly. If the block operates in FP16 space and the > >>>> interface is 16 bits we are good, but going from 32 bits to FP16 (such > >>>> as in the matrix case or 3DLUT) is less than ideal. > >>> > >>> Hi Christopher, > >>> > >>> are you thinking of precision loss, or the overhead of conversion? > >>> > >>> Conversion from N-bit fixed point to N-bit floating-point is generally > >>> lossy, too, and the other direction as well. > >>> > >>> What exactly would be messy? > >>> > >> Overheard of conversion is the primary concern here. Having to extract > >> and / or calculate the significand + exponent components in the kernel > >> is burdensome and imo a task better suited for user space. This also has > >> to be done every blob set, meaning that if user space is re-using > >> pre-calculated blobs we would be repeating the same conversion > >> operations in kernel space unnecessarily. > > > > What is burdensome in that calculation? I don't think you would need to > > use any actual floating-point instructions. Logarithm for finding the > > exponent is about finding the highest bit set in an integer and > > everything is conveniently expressed in base-2. Finding significand is > > just masking the integer based on the exponent. > > > Oh it definitely can be done, but I think this is just a difference of > opinion at this point. At the end of the day we will do it if we have > to, but it is just more optimal if a more agreeable common type is used. > > > Can you not cache the converted data, keyed by the DRM blob unique > > identity vs. the KMS property it is attached to? > If the userspace compositor has N common transforms (ex: standard P3 -> > sRGB matrix), they would likely have N unique blobs. Obviously from the > kernel end we wouldn't want to cache the transform of every blob passed > down through the UAPI. Hi Christoper, as long as the blob exists, why not? > > You can assume that userspace will not be re-creating DRM blobs without > > a reason to believe the contents have changed. If the same blob is set > > on the same property repeatedly, I would definitely not expect a driver > > to convert the data again. > If the blob ID is unchanged there is no issue since caching the last > result is already common. As you say, blobs are immutable so no update > is needed. I'd question why the compositor keeps trying to send down the > same blob ID though. To avoid hard-to-debug situations with userspace vs. kernel view of KMS state getting out of sync by a bug, for example. I did originally write such KMS state caching in Weston to avoid emitting unchanged state, but that was deemed unnecessary as the kernel side needs to do the same comparisons "anyway". > > If a driver does that, it seems like it > > should be easy to avoid, though I'm no kernel dev. Even if the > > conversion was just a memcpy, I would still posit it needs to be > > avoided when the data has obviously not changed. Blobs are immutable. > > > Userspace having to use hardware-specific number formats would probably > > not be well received. > > > To be clear, I am not asking user space to use custom value packing made > purely for the hardware's benefit (this sounds like a problem just > waiting to happen). Just support in the color pipeline UAPI for common > numerical data types such as 16-bit floats. That said... I wonder if there actually is a significant difference between converting float<->float and int<->float if everything else is equally fine. It's possible that requirements on range and precision do call for both types in UAPI, then we obviously need both. > >> I agree normalization of the value causing precision loss and rounding > >> we can't avoid. > >> > >> We should also consider the fact that float pipelines have been known to > >> use the scrgb definition for floating point values > >> (https://registry.khronos.org/EGL/extensions/EXT/EGL_EXT_gl_colorspace_scrgb_linear.txt). > > > > scRGB is as good a definition of color encoding as "floating-point" is > > for numbers. What I mean is that it carries very little usable meaning, > > and without further information it is practically boundless > > - infinite - in both color gamut and dynamic range. Just like any > > floating-point quantity. > > > > However, what we want from KMS color pipeline is zero implied or > > defined meaning. That means scRGB carries too much meaning, because the > > primaries are fixed and (1.0, 1.0, 1.0) should match sRGB/SDR white. > > > Btw. if one brings in nit units, you assume a specific viewing > > environment which is rarely true in reality. I'll leave that rabbit > > hole for another time. I just want to mention that nit (cd/m²) is a > > unit that is relative to the chosen viewing environment when your goal > > is a specific perception of brightness. > > > >> In cases like this where there may be a expected value range in the > >> pipeline, how to normalize a larger input becomes a little confusing. Ex > >> - Does U32 MAX become FP16 MAX or value MAX (i.e 127). > > > > UAPI simply needs to specify the number encoding used in the UAPI, how > > bit patterns map to real numbers. Real numbers are then what the color > > pipeline operates on. > > > If we plan to have the color pipeline UAPI expose these details then I > am satisfied. Very good. I do not see how else it could even work. Thanks, pq > > However, intermediate value representation used between two KMS colorop > > blocks is never observable to userspace. All userspace needs to know is > > the usable value range and precision behaviour. I think that is best > > defined for the input and output of each block rather than what flows > > in between, because an optional (e.g. LUT) block when bypassed does not > > impose its limitations. > > > Sure. Everything in between can be inferred from the pipeline. > > > What does 1.0 actually mean, that is left for userspace to use however > > it wishes. There are only pipeline boundary conditions to that: the > > input to a pipeline comes from a DRM FB, so it has a number encoding > > specified mostly by pixel format, and an arbitrary colorimetric > > encoding that only userspace knows. The output of the pipeline has to > > be standardised so that drivers can number-encode the pipeline output > > correctly to wire format on e.g. HDMI. Userspace alone is responsible > > for making sure the colorimetry matches what the sink expects. > > > > Individual KMS color pipeline colorop blocks need to define their own > > acceptable input and output ranges. E.g. a look-up table may assume > > that it's input is in [0.0, 1.0] and anything outside is clamped to > > that range. That poses restrictions on how userspace can use the block. > > > >>>> > >>>>> Exposing the actual hardware precision is something we've talked about during > >>>>> the hackfest. It'll probably be useful to some extent, but will require some > >>>>> discussion to figure out how to design the uAPI. Maybe a simple property is > >>>>> enough, maybe not (e.g. fully describing the precision of segmented LUTs would > >>>>> probably be trickier). > >>>>> > >>>>> I'd rather keep things simple for the first pass, we can always add more > >>>>> properties for bit depth etc later on. > >>>>> > >>>> Indicating if a block operates on / with fixed vs float values is > >>>> significant enough that I think we should account for this in initial > >>>> design. It will have a affect on both the user space value packing + > >>>> expected value ranges in the hardware. > >>> > >>> What do you mean by "value packing"? Memory layout of the bits forming > >>> a value? Or possible exact values of a specific type? > > >> Both really. If the kernel is provided a U32 value, we need to know if > >> this is a U32 value, or a float packed into a U32 container. Likewise as > >> mentioned with the scRGB above, float could even adjust the value range > >> expectations. > > > > Right. The UAPI will simply define that. > > > Great! > > Thanks, > Christopher > > >>> I don't think fixed vs. float is the most important thing. Even fixed > >>> point formats can have different numbers of bits for whole numbers, > >>> which changes the usable value range and not only precision. Userspace > >>> at the very least needs to know the usable value range for the block's > >>> inputs, outputs, and parameters. > >>> > >>> When defining the precision for inputs, outputs and parameters, then > >>> fixed- vs. floating-point becomes meaningful in explaining what "N bits > >>> of precision" means. > >>> > >>> Then there is the question of variable precision that depends on the > >>> actual block input and parameter values, how to represent that. Worst > >>> case precision might be too pessimistic alone. > >>> > >> Agreed. More information probably is needed to full define the interface > >> expectations. > >> > >>>>>>> Here is another example with a 3D LUT: > >>>>>>> > >>>>>>> Color operation 42 > >>>>>>> ├─ "type": enum {Bypass, 3D LUT} = 3D LUT > >>>>>>> ├─ "lut_size": immutable range = 33 > >>>>>>> ├─ "lut_data": blob > >>>>>>> └─ "next": immutable color operation ID = 43 > >>>>>>> > >>>>>> We are going to need to expose the packing order here to avoid any > >>>>>> programming uncertainty. I don't think we can safely assume all hardware > >>>>>> is equivalent. > >>>>> > >>>>> The driver can easily change the layout of the matrix and do any conversion > >>>>> necessary when programming the hardware. We do need to document what layout is > >>>>> used in the uAPI for sure. > >>>>> > >>>>>>> And one last example with a matrix: > >>>>>>> > >>>>>>> Color operation 42 > >>>>>>> ├─ "type": enum {Bypass, Matrix} = Matrix > >>>>>>> ├─ "matrix_data": blob > >>>>>>> └─ "next": immutable color operation ID = 43 > >>>>>>> > >>>>>> It is unclear to me what the default sizing of this matrix is. Any > >>>>>> objections to exposing these details with an additional property? > >>>>> > >>>>> The existing CTM property uses 9 uint64 (S31.32) values. Is there a case where > >>>>> that wouldn't be enough? > >>>> > >>>> Larger cases do exist, but as you mention this can be resolved with a > >>>> different type then. I don't have any issues with the default 'Matrix' > >>>> type being 9 entries. > >>> > >>> Please, tell us more. How big, and what are they used for? > >>> > >>> IIRC ICC has 3x3 matrix + offset vector. Do you have even more? > >>> > >>> > >> Offset is one. Range adjustment 'vector' is another. But ultimately this > >> proposal is flexible enough that this can probably just be another color > >> block in the pipeline. No complaints from me here. > > > > What is a range adjustment vector? A vector of a multiplier per color > > channel? Does it include offset? > > > > Yes, sounds like just another block. > > > >>>>> > >>>>>> Dithering logic exists in some pipelines. I think we need a plan to > >>>>>> expose that here as well. > >>>>> > >>>>> Hm, I'm not too familiar with dithering. Do you think it would make sense to > >>>>> expose as an additional colorop block? Do you think it would have more > >>>>> consequences on the design? > >>> > >>> I think it would be an additional block, and no other consequences, be > >>> it temporal and/or spatial dithering, as long as it does not look at > >>> neighbouring pixels to determine the output for current pixel. > >>> > >>>>> > >>>>> I want to re-iterate that we don't need to ship all features from day 1. We > >>>>> just need to come up with a uAPI design on which new features can be built on. > >>>>> > >>>> > >>>> Agreed. I don't think this will affect the proposed design so this can > >>>> be figured out once we have a DRM driver impl that declares this block. > >>>> > >>>>>>> [Simon note: an alternative would be to split the color pipeline into > >>>>>>> two, by > >>>>>>> having two plane properties ("color_pipeline_pre_scale" and > >>>>>>> "color_pipeline_post_scale") instead of a single one. This would be > >>>>>>> similar to > >>>>>>> the way we want to split pre-blending and post-blending. This could be less > >>>>>>> expressive for drivers, there may be hardware where there are dependencies > >>>>>>> between the pre- and post-scaling pipeline?] > >>>>>>> > >>>>>> As others have noted, breaking up the pipeline with immutable blocks > >>>>>> makes the most sense to me here. This way we don't have to predict ahead > >>>>>> of time every type of block that maybe affected by pipeline ordering. > >>>>>> Splitting the pipeline into two properties now means future > >>>>>> logical splits would require introduction of further plane properties. > >>>>> > >>>>> Right, if there are more "breaking points", then we'll need immutable blocks > >>>>> anyways. > >>>>> > >>>>>>> Then, Alex from NVIDIA described how their hardware works. NVIDIA hardware > >>>>>>> contains some fixed-function blocks which convert from LMS to ICtCp and > >>>>>>> cannot > >>>>>>> be disabled/bypassed. NVIDIA hardware has been designed for descriptive > >>>>>>> APIs > >>>>>>> where user-space provides a high-level description of the colorspace > >>>>>>> conversions it needs to perform, and this is at odds with our KMS uAPI > >>>>>>> proposal. To address this issue, we suggest adding a special block type > >>>>>>> which > >>>>>>> describes a fixed conversion from one colorspace to another and cannot be > >>>>>>> configured by user-space. Then user-space will need to accomodate its > >>>>>>> pipeline > >>>>>>> for these special blocks. Such fixed hardware blocks need to be well enough > >>>>>>> documented so that they can be implemented via shaders. > >>>>>>> > >>>>>> A few questions here. What is the current plan for documenting the > >>>>>> mathematical model for each exposed block? Will each defined 'type' enum > >>>>>> value be locked to a definition in the kernel documents? As an example, > >>>>>> when we say '3D LUT' in this proposal does this mean the block will > >>>>>> expose a tri-linear interpolated 3D LUT interface? Is '1D Curve' a > >>>>>> direct in to out LUT mapping? > >>>>> > >>>>> I think we'll want to document these things, yes. We do want to give _some_ > >>>>> slack to drivers, so that they can e.g. implement the "1D LUT" colorop via > >>>>> hardware segmented LUTs with a different number of elements per LUT segment. > >>>>> But being mathematically precise (probably with formulae in the docs) is > >>>>> definitely a goal, and absolutely necessary to implement a shader-based > >>>>> fallback. > >>>> > >>>> I agree some driver slack is necessary, however ideally this will be > >>>> locked down enough that from the compositor side they see "1D LUT" and > >>>> know exactly what to expect independent of the hardware. This way > >>>> regardless of if I am running on a NVIDIA / AMD / QCOM / etc... chip, > >>>> common color pipeline strategies can be used. Assuming a perfect world > >>>> where there is a workable overlap between chips of course. > >>> > >>> Yes, of course, at least for a start. > >>> > >>> However, the long term plan includes a shared userspace library with > >>> driver- and hardware-specific knowledge to use hardware- and > >>> driver-specific blocks. All blocks still need to be explicitly > >>> specified in the kernel UAPI documentation, the idea is that it should > >>> not be a problem for many vendors to have blocks no-one else does. The > >>> library would offer a much more generic API, and use snowflake blocks > >>> to their fullest. The library would also spit out OpenGL shaders and > >>> whatnot for the fallback. > >>> > >>> The future in the long term could be either way: evolving towards > >>> generic KMS UAPI blocks with no need for a userspace library > >>> abstraction, or evolving towards hardware-specific KMS UAPI blocks with > >>> a userspace library to abstract them like Mesa does for GPUs. > >>> > >> Sounds good to me! > > > > Awesome! > > > > > > Thanks, > > pq
Attachment:
pgpknoa8qfkIz.pgp
Description: OpenPGP digital signature