Re: [RFC] drm/amdgpu: Add macros and documentation for format modifiers.

Christian König <christian.koenig@xxxxxxx> · Tue, 4 Sep 2018 20:06:05 +0200

Am 04.09.2018 um 20:00 schrieb Bas Nieuwenhuizen:
On Tue, Sep 4, 2018 at 7:57 PM Bas Nieuwenhuizen
<bas@xxxxxxxxxxxxxxxxxxx> wrote:
On Tue, Sep 4, 2018 at 7:48 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 04.09.2018 um 18:37 schrieb Daniel Vetter:
On Tue, Sep 4, 2018 at 5:52 PM, Bas Nieuwenhuizen
<bas@xxxxxxxxxxxxxxxxxxx> wrote:
On Tue, Sep 4, 2018 at 4:43 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Tue, Sep 4, 2018 at 3:33 PM, Bas Nieuwenhuizen
<bas@xxxxxxxxxxxxxxxxxxx> wrote:
On Tue, Sep 4, 2018 at 3:04 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Tue, Sep 04, 2018 at 02:33:02PM +0200, Bas Nieuwenhuizen wrote:
On Tue, Sep 4, 2018 at 2:26 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Tue, Sep 04, 2018 at 12:44:19PM +0200, Christian König wrote:
Am 04.09.2018 um 12:15 schrieb Daniel Stone:
Hi,

On Tue, 4 Sep 2018 at 11:05, Daniel Vetter <daniel.vetter@xxxxxxxx> wrote:
On Tue, Sep 4, 2018 at 3:00 AM, Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx> wrote:
+/* The chip this is compatible with.
+ *
+ * If compression is disabled, use
+ *   - AMDGPU_CHIP_TAHITI for GFX6-GFX8
+ *   - AMDGPU_CHIP_VEGA10 for GFX9+
+ *
+ * With compression enabled please use the exact chip.
+ *
+ * TODO: Do some generations share DCC format?
+ */
+#define AMDGPU_MODIFIER_CHIP_GEN_SHIFT                 40
+#define AMDGPU_MODIFIER_CHIP_GEN_MASK                  0xff
Do you really need all the combinations here of DCC + gpu gen + tiling
details? When we had the entire discussion with nvidia folks they
eventually agreed that they don't need the massive pile with every
possible combination. Do you really plan to share all these different
things?

Note that e.g. on i915 we spec some of the tiling depending upon
buffer size and buffer format (because that's how the hw works), not
using explicit modifier flags for everything.
Right. The conclusion, after people went through and started sorting
out the kinds of formats for which they would _actually_ export real
colour buffers for, that most vendors definitely have fewer than
115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936
possible formats to represent, very likely fewer than
340,282,366,920,938,463,463,374,607,431,768,211,456 formats, probably
fewer than 72,057,594,037,927,936 formats, and even still generally
fewer than 281,474,976,710,656 if you want to be generous and leave 8
bits of the 56 available.
The problem here is that at least for some parameters we actually don't know
which formats are actually used.

The following are not real world examples, but just to explain the general
problem.

The memory configuration for example can be not ASIC specific, but rather
determined by whoever took the ASIC and glued it together with VRAM on a
board. It is not likely that somebody puts all the VRAM chips on one
channel, but it is still perfectly possible.

Same is true for things like harvesting, e.g. of 16 channels halve of them
could be bad and we need to know which to actually use.
For my understanding: This leaks outside the chip when sharing buffers?
All the information you only need locally to a given amdgpu instance
don't really need to be encoded in modifiers.

Pointers to code where this is all decided (kernel and radeonsi would be
good starters I guess) would be really good here.
I extracted the information on which bits are relevant mostly from the
AddrFromCoord functions in addrlib in mesa:

for macro-tiles:
https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/addrlib/r800/egbaddrlib.cpp#L1587

for micro-tiles (or the micro-tiles in macro-tiles):

https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/addrlib/core/addrlib1.cpp#L3016
So this is the decoding thing. How many of these actually exist, even when
taking all the other information into account?

E.g. given a platform + memory config (seems needed) + drm_fourcc + stride
+ height + width, how much of all these bits do you actually still freely
pick?
Basically you pick ARRAY_MODE (linear, micro-tile, macro-tile, sparse,
thick variants of macro-tile), MICRO_TILE_MODE(display, non-display,
depth, display-rotated) + whether to use compression, everything else
is fixed given those option, the properties of the chip and the
format.

It might be that all the things you need to know from the memory config
don't encode smaller than the macro/micro/whatever else stuff. But that's
kinda the angle that we looked at this for everyone else.

E.g. for multi-plane stuff, if everyone picks the same config for the
2nd/3rd plane, then you don't actually need to encode that. It just
becomes part of the implied stuff in the modifier.
The problem is some GPUs are compatible for say 8-bpp images, but not
for 32-bpp surfaces. e.g. lets look at the following table showing the
current configuration for all GFX6-GFX8 GPU:

format: (bank width, bank height, macro tile aspect, num banks) for
8-bpp, 16-bpp and 32 bpp single-sample followed by the PIPE_CONFIG

verde: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 2, 16) ADDR_SURF_P4_8x16
oland: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 2, 16) ADDR_SURF_P4_8x16
hainan: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 2, 16) ADDR_SURF_P2
tahiti/pitcairn: (1, 4, 1, 16) (1, 2, 1, 16) (1, 1, 1, 16)
ADDR_SURF_P8_32x32_8x16
bonaire: (1,4,4,16) (1, 2, 4, 16) (1, 1, 2, 16) ADDR_SURF_P4_16x16
hawaii: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 1, 16) ADDR_SURF_P16_32x32_16x16
CIK APUs: (1,4,4, 8), (1,2,4,8),  (1, 2, 2, 8) ADDR_SURF_P2
topaz: (4, 4, 2, 8) (4, 4, 2, 8) (2, 4, 2, 8) ADDR_SURF_P2
fiji: (1, 4, 2, 8) (1, 4, 2, 8) (1, 4, 2, 8) ADDR_SURF_P16_32x32_16x16
tonga:: (1, 4, 4, 16), (1, 4, 4, 16) (1, 4, 4, 16) ADDR_SURF_P8_32x32_16x16
polaris11/12: (1, 4, 4, 16), (1, 4, 4, 16) (1, 4, 4, 16) ADDR_SURF_P4_16x16
polaris10: (1, 4, 4, 16), (1, 4, 4, 16) (1, 4, 4, 16) ADDR_SURF_P8_32x32_16x16
stoney,carrizo: (1, 4, 4, 8) (1, 2, 4, 8), (1, 1, 2, 8) ADDR_SURF_P2

We see here that e.g. Stoney and the CIK APUs are compatible for 8-bpp
and 16-bpp, but not 32-bpp. or bonaire and polaris11/12 are only
compatible for 8-bpp.

So we can't just assume that if the first plane properties match that
they do so for the second plane, because we don't know what GPU it is
coming from.

And we can make a canonical table from this but then if we change the
above tables in the kernel that runs into compatibility issues.
I think that's roughly what nvidia ended up doing (not sure they
published anything, haven't seen any patches on dri-devel). Two parts:
- Bunch of flags/bits for the major mode, like {display, non-display,
DCC} and so on.
- For each of the major modes a list of enumerated modes actually used
in reality. For DCC this would be the generation, for more plain
formats it would be the above table. Table entries would be per-bpp
(in case that matters, or whatever matters for amd gpus).

Imo would be totally ok to put the relevant lookup tables into
drm_fourcc.h directly. Or any other suitable place.
Right, the issue would be that the tables are configured int he kernel

https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c?h=amd-staging-drm-next#n417

and having a shared hardcoded table means that if the kernel tables
change we break userspace. That would be a new commitment on the side
of the kernel driver.

(If I can get an okay for that commitment from Alex/Christian I'd be
totally happy to switch it over)
Yeah it'd make that table uapi
That won't work. Those lists are not necessary stable.

What we could do is to use PCI-ID plus tile_mode/macro_tile_mode index
plus a 8bit version number.
But if we use PCI-ID, then we cannot really share between different
models of GPU ...

So we have seen 3 options on design level:
1) Use PCI-ID/chip.
   Advantage:
     -simple
     - Few bits
   Disadvantage:
     - No sharing between different GPU models
2) Use a fixed lookup/canonicalization table.
   Advantage:
     -Few bits
     - Sharing between models
   Disadvantage:
     - Cannot change the kernel tiling tables.
A fourth option would actually be to put this fixed table in the
kernel and have an ioctl so userspace can ask for it. That way we
avoid the kernel update issue.

That is already present. The table is actually only used for GFX.

MM, SDMA, DCE have separate ways of configuring that and so need the 
config for each entry.

3) Encode tiling mode properties in modifier.
   Advantage:
     - Can share between models.
     - Can change the kernel tiling tables.
   Disadvantage:
     - Use a lot of bits.

Overall the number of exposed modifiers should be pretty much the
same, so encoding size that we have to deal with all possible bit
patterns.

I've implemented option 3 in the patch here because I think it is the
best tradeoff. As far as I can see it fits, and especially if we do no
multisample support and remove the tile split bytes, and remove the
3rd set of plane macro-tile options because no format has 3 different
bitdepths for the 3 planes, we are left with using 37 bits, leaving 19
for future extensions and a version number. It is not great but pretty
doable?

Yeah, the issue is you seem to look only at the public available bits of 
addrlib.

That might not be enough to get a complete picture, but I think for 
sharing between processes it should be enough.

Regards,
Christian.

Shouldn't be more than 32 bits in total.

Regards,
Christian.

   - well the compressed from with
duplicates removed, if you really want to make it tiny. But you can
just add new entries if the heuristics for a given platform changes,
as long as you keep the old entries working (if you shipped them ever
ofc, otherwise totally ok to change as you see fit).
-Daniel

  From a quick look about 16 bits total. Gives you 40bits of second
chances and "oops totally sorry".
I purposefully left 8 bits as a version field for those oopses.

-Daniel

-Daniel

-Daniel

Regards,
Christian.

If you do use 256 bits in order to represent
115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936
modifiers per format, userspace would start hitting OOM pretty quickly
as it attempted to enumerate and negotiate acceptable modifiers.
Either that or we need to replace the fixed 64-bit modifier tokens
with some kind of eBPF script.

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel