DRM/KMS PWL API Thoughts and Questions

Harry Wentland <harry.wentland@xxxxxxx> · Fri, 10 Jun 2022 16:29:13 -0400

(I'm sending this as an email as lowest common denominator but feel an issue on the color-and-hdr repo would be a better interface for productive discussion. Please pop over to https://gitlab.freedesktop.org/pq/color-and-hdr/-/issues/10 if you agree. Hopefully we can drive the discussion there but if there is a strong preference for email that works as well. :) )

I've wanted to start a thread to discuss the use of PWL APIs that were introduced by Uma a year ago and for which Bhanuprakash provided IGT tests. I have come to like the API but as we're getting closer to a real-world use of it I have a few questions and comments. As with a lot of complex APIs the devil is in the details. Some of those details are currently underspecified, or underdocumented and it's important that we all interpret the API the same way.

**The API**

The original patches posted by Uma:
https://patchwork.freedesktop.org/series/90822/
https://patchwork.freedesktop.org/series/90826/

The IGT tests for PWL API:
https://patchwork.freedesktop.org/series/96895/

I've rebased the kernel patches on a slightly more recent kernel, along with an AMD implementation:
https://gitlab.freedesktop.org/hwentland/linux/-/tree/color-and-hdr

I've also rebased them on an IGT tree, but that's not too up-to-date:
https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/tree/color-and-hdr

**Why do I like the API?**

In order to allow HW composition of HDR planes in linear space we need the ability to program at least a per-CRTC regamma (inv_EOTF) to go from linear to wire format post-blending. Since userspace might want to apply corrections on top of a simple transfer function (such as PQ, BT.709, etc.) it would like a way to set a custom LUT.

The existing CRTC gamma LUT defines equally spaced entries. As Pekka shows in [1] equally-spaced LUTs have unacceptable error for regamma/inv_EOTF. Hence we need finer granularity of our entries near zero while coarse granularity works fine toward the brighter values.

[1] https://gitlab.freedesktop.org/pq/color-and-hdr/-/merge_requests/9

HW (at least AMD and Intel HW) implements this ability as segmented piece-wise linear LUTs. These define segments of equally spaced entries. These segments are constrained by the HW implementation. I like how the PWL API allows different drivers to specify the constraints imposed by different HW while allowing userspace a generic way of parsing the PWL. This also avoids complex calculations in the kernel driver, which might be required for other APIs one could envision. If anyone likes I can elaborate on some ideas for an alternate API, though all of them will require non-trivial transformations by the kernel driver in order to program them to HW.

**Nitpicks**

The API defines everything inside the segments, including flags and values that apply to the entire PWL, such as DRM_MODE_LUT_GAMMA, DRM_MODE_LUT_REFLECT_NEGATIVE, input_bpc, and output_bpc. If these don't stay constant for segments it might complicate the interpretation of segments. I suggest we consider these as effectively applying to the entire PWL. We could encode them in an overall drm_color_lut struct that includes an array of drm_color_lut_range but that's probably not necessary, hence why I called this out as a nitpick. I would just like us to be aware of this ambiguity and document that these values applies to the entire PWL.

**How to read the PWL**

Let me first give a summary for how this LUT is used in userspace. If you're familiar with this please review and comment if I got things wrong. As I mentioned, a lot of this is underspecified at the moment so you're reading my interpretation.

You can see this behavior in plane_degamma_test [2] in the kms_color.c IGT test suite. I suggest the plane_degamma_test here here instead of the test_pipe_gamma test as the latter still has Intelisms (assumptions around Intel driver/HW behavior) and will not work for other drivers.

Iterate over all enums in PLANE_DEGAMMA_MODE and find a suitable one. How do we find the suitable one? More on that below.

Once we have the right PLANE_DEGAMMA_MODE we read the blob for the blob ID associated with the PLANE_DEGAMMA_MODE enum. We interpret the blob as an array of drm_color_lut_range. See get_segment_data [3].

We can think of our LUT/PWL as f(x) = y. For a traditional equally spaced LUT with 1024 entries x would be 0, 1, 2, ..., 1023. For a PWL LUT we need to parse the segment data provided in drm_color_lut_range.

Let's look at the 2nd-last entry of the nonlinear_pwl definition for the AMD driver [4] (I've correct it here and dropped the DRM_MODE_LUT_REUSE_LAST but it's still incorrect in the link) and simplify it to 4 entries for sake of readability:

{
        .flags = (DRM_MODE_LUT_GAMMA | DRM_MODE_LUT_REFLECT_NEGATIVE | DRM_MODE_LUT_INTERPOLATE | DRM_MODE_LUT_NON_DECREASING),
        .count = 4,
        .input_bpc = 13, .output_bpc = 18,
        .start = 1 << 12, .end = 1 << 13,
        .min = 0, .max = 1UL << 35
    },

We see we have 16 entries in the region from (1 << 12) to (1 << 13). We see input_bpc is 13, so our 1.0 float value is 1 << 13.

(Is it sensible to use input_bpc as our 1.0 floating point reference? Why?)

Since this segment is not reusing the last entry (doesn't have DRM_MODE_LUT_REUSE_LAST) we divide the region between 1 << 12 and 1 << 13 into 4 equally spaced sections.

step_size = (segment->end - segment->start) / count

In our case our segment spans from 4096 to 8192 and yields a step_size of 1024.

Note that we need to calculate this in floating point, otherwise we're not guaranteed equal spacing.

This gives us these X entries for this particular segment:
4096, 5120, 6144, 7168

And normalized to 1 << 13 (input_bpc) for our 1.0 float value we get
0.5, 0.625, 0.75, 0.875

If the segment had the REUSE_LAST flag the values would look like
4096, 5461, 6826, 8192

and normalized to
0.5, 0.666626, 0.833252, 1

Though in the case of REUSE_LAST a more sensible definition might be with a count of 5 entries in this segment, instead of 16. But ultimatly that's up to the driver.

I attached a simple C program that parses a PWL and helps illustrate my interpretation. You'll just need to copy-paste or include your PWL definition (instead of color_gamma.h), and point the PWL define to your PWL variable. To build it you'll need to copy the drm-uapi folder from my IGT repo into the same directory as pwl-parser.c and then just build it with "gcc -Idrm-uapi pwl-parser.c".

The above entries come from running pwl-parser with the nonlinear_pwl from [5].

To illustrate these I added versions of your lut1d_error scripts that run on the SDR and HDR PWLs for AMD HW [6].

We should probably document the above in detail in the DRM/KMS API docs.

[2] https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/blob/color-and-hdr/tests/kms_color.c#L978
[3] https://gitlab.freedesktop.org/hwentland/igt-gpu-tools/-/blob/color-and-hdr/tests/kms_color_helper.c#L393
[4] https://gitlab.freedesktop.org/hwentland/linux/-/blob/color-and-hdr/drivers/gpu/drm/amd/display/modules/color/color_gamma.h#L109
[5] https://gitlab.freedesktop.org/hwentland/linux/-/blob/color-and-hdr/drivers/gpu/drm/amd/display/modules/color/color_gamma.h#L38
[6] https://gitlab.freedesktop.org/hwentland/color-and-hdr/-/tree/precision/octave

**How to provide PWL entries?**

To provide the PWL values for each entry we'll have to sample our (custom) curve at the respective points specified by our PWL entries and providing those samples in a blob that is passed to PLANE_DEGAMMA_LUT. It's not much different from how we provide values for an equally spaced (legacy) LUT. As for sampling our curve, I remember seeing that Weston uses an LCMS function to sample the curve at required points. Last I checked it samples the curve at evenly spaced intervals but the LCMS function seemed to provide arbitrary sampling.

**How to pick a suitable PWL definition?**

Picking the right PWL definition out of the respective \_MODE enum isn't trivial. Without further information a userspace implementer has to understand the distribution of entries in all segments and perform a bunch of math to estimate the error for given curves.

A simpler approach might be if we defined common naming for our PWL enums. We can define the commonly expected cases. The two important parameters are within-range vs overrange entries, and linear vs non-linear outputs.

Within-range PWLs would cover [0.0, 1.0] entries and overrange [0.0, 128.0] to cover PQ and probably anything else. One could think of the within-range PWL as intended for SDR content and the over-range PWL for HDR content.

Linear PWLs would be intended for linear luminance outputs (or near-linear), and can be represented by equally spaced LUTs, such as a single-segment definitions. Non-linear PWLs would be intended for linear to non-linear transforms; Linear PWLs for non-linear to linear transforms or OOTFs.

This gives us four enums, plus one for bypass:
DRM_MODE_LUT_BYPASS
DRM_MODE_LUT_LINEAR_SDR
DRM_MODE_LUT_NONLINEAR_SDR
DRM_MODE_LUT_LINEAR_HDR
DRM_MODE_LUT_NONLINEAR_HDR

Drivers can provide appropriate PWLs based on the HW caps. Userspace can pick an appropriate one if it's available or fall back to either a sub-optimal PWL or to using a GPU transform instead.

Thoughts?

Thanks,
Harry
/*
 * Copyright 2022 Advanced Micro Devices, Inc.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
 * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
 * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
 * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 *
 * Authors: AMD
 *
 */

#define BIT(x) (1 << x)
#include "drm_mode.h"
#include "color_gamma.h"
#include <stdio.h>

#define PWL nonlinear_hdr_pwl

#define MAX_ENTRIES_PER_SEGMENT 512
#define MAX_ENTRIES 4096

int main(int argc, char *argv[])
{
	__u64 entries[MAX_ENTRIES];
	unsigned pwl_size = 0;
	unsigned segment_size = 0;
	unsigned segment_count = 0;
	unsigned i = 0;
	unsigned j = 0;
	unsigned overall_count = 0;
	__u64 norm = 0;

	pwl_size = sizeof(PWL);
	segment_size = sizeof(PWL[0]);
	segment_count = pwl_size / segment_size;

	printf("pwl_size = %d, segment_size = %d, segment_count = %d\n",
	       pwl_size, segment_size, segment_count);

	for (i = 0; i < segment_count; i++) {
		const struct drm_color_lut_range *segment = &PWL[i];
		__u64 segment_entries[MAX_ENTRIES_PER_SEGMENT];
		unsigned count = 0;
		__u64 step_size = 0;

		printf("Segment %d\n", i);
		printf("\t.flags = 0x%x\n", segment->flags);
		printf("\t.count = %d\n", segment->count);
		printf("\t.input_bpc = %d, .output_bpc = %d\n", segment->input_bpc, segment->output_bpc);
		printf("\t.start = 0x%lx, .end = %lx\n", segment->start, segment->end);
		printf("\t.min = 0x%lx, .max = %lx\n", segment->min, segment->max);

		norm = segment->input_bpc;

		if (segment->count == 1) {
			printf("\t- steps = %d\n", segment->start);
			printf("\t- normalized steps = %g\n", segment->start / ((1 << segment->input_bpc) + 0.0));
			entries[overall_count++] = segment->start;
			continue;
		}

		if (segment->flags & DRM_MODE_LUT_REUSE_LAST) {
			/*
			 * Entries are evenly spaced in [start, end]
			 * including "end" as an entry.
			 * 
			 * This is the same as count -1 entries when
			 * REUSE_LAST isn't used, with "end" added
			 * as the additional entry
			 */
			count = segment->count - 1;
		} else {
			/*
			 * entries are evenly spaced in [start, end]
			 * excluding "end" as an entry
			 * 
			 * this will look the same as a REUSE_LAST
			 * distribution of count + 1 with the last
			 * entry omitted
			 */
			count = segment->count;
		}

		step_size = (segment->end - segment->start) / (count);

		printf("\t- step_size count 0x%x, step_size 0x%x\n", count, step_size);

		for (j = 0; j < count; j++) {
			segment_entries[j] = segment->start + (j * step_size);
			entries[overall_count++] = segment_entries[j];
		}

		if (segment->flags & DRM_MODE_LUT_REUSE_LAST) {
			segment_entries[j] = segment->end;
			entries[overall_count++] = segment_entries[j];
		}

		/* print all entries */

		count = (segment->flags & DRM_MODE_LUT_REUSE_LAST) ? count + 1 : count;

		printf("\t- steps = ");
		for (j = 0; j < count; j++) {
			printf("%ld", segment_entries[j]);
			if (j < count-1)
				printf(", ");
		}
		printf("\n");

		printf("\t- normalized steps = ");
		for (j = 0; j < count; j++) {
			printf("%g", segment_entries[j] / ((1 << segment->input_bpc) + 0.0));
			if (j < count-1)
				printf(", ");
		}
		printf("\n");

	}

	printf("steps = ");
	for (i = 0; i < overall_count; i++) {
		printf("%ld", entries[i]);
		if (i < overall_count-1)
			printf(", ");
	}
	printf("\n");

	printf("normalized steps = ");
	for (i = 0; i < overall_count; i++) {
		printf("%g", entries[i] / ((1 << norm) + 0.0));
		if (i < overall_count-1)
			printf(", ");
	}
	printf("\n");
	printf("%d entries\n", overall_count);

	return 0;
}