+1. It would help to have an explicit CCS modifier should an unforeseen corner case arise.
Also, as Ken mentioned, CPU mapped accesses require that compressed images be resolved beforehand. By having both a 4_TILED modifier and a 4_TILED_XE2_CCS modifier, applications can avoid performance penalties
if they would like to do frequent CPU mapped accesses.
-Nanley
From: Zhang, Jianxun <jianxun.zhang@xxxxxxxxx>
Sent: Friday, July 19, 2024 4:17 PM
To: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>; Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Juha-Pekka Heikkila <juhapekka.heikkila@xxxxxxxxx>; Graunke, Kenneth W <kenneth.w.graunke@xxxxxxxxx>; Roper, Matthew D <matthew.d.roper@xxxxxxxxx>
Cc: intel-xe@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Chery, Nanley G <nanley.g.chery@xxxxxxxxx>; Saarinen, Jani <jani.saarinen@xxxxxxxxx>; Souza, Jose <jose.souza@xxxxxxxxx>; Mathew, Alwin <alwin.mathew@xxxxxxxxx>; Syrjala, Ville <ville.syrjala@xxxxxxxxxxxxxxx>;
Nikula, Jani <jani.nikula@xxxxxxxxx>
Subject: Re: [RFC PATCH 0/3] Introducing I915_FORMAT_MOD_4_TILED_XE2_CCS Modifier for Xe2
From: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
Sent: Tuesday, May 14, 2024 9:51 AM
To: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>;
Juha-Pekka Heikkila <juhapekka.heikkila@xxxxxxxxx>;
Graunke, Kenneth W <kenneth.w.graunke@xxxxxxxxx>; Roper,
Matthew D <matthew.d.roper@xxxxxxxxx>
Cc: intel-xe@xxxxxxxxxxxxxxxxxxxxx <intel-xe@xxxxxxxxxxxxxxxxxxxxx>;
intel-gfx@xxxxxxxxxxxxxxxxxxxxx <intel-gfx@xxxxxxxxxxxxxxxxxxxxx>;
Chery, Nanley G <nanley.g.chery@xxxxxxxxx>; Saarinen, Jani
<jani.saarinen@xxxxxxxxx>; Souza, Jose <jose.souza@xxxxxxxxx>;
Mathew, Alwin <alwin.mathew@xxxxxxxxx>; Zhang, Jianxun <jianxun.zhang@xxxxxxxxx>;
Syrjala, Ville <ville.syrjala@xxxxxxxxxxxxxxx>; Nikula,
Jani <jani.nikula@xxxxxxxxx>
Subject: Re: [RFC PATCH 0/3] Introducing I915_FORMAT_MOD_4_TILED_XE2_CCS Modifier for Xe2
On Tue, 2024-05-14 at 12:25 +0300, Joonas Lahtinen wrote:
> Quoting Kenneth Graunke (2024-05-11 03:58:34)
> > On Tuesday, May 7, 2024 3:56:57 PM PDT Matt Roper wrote:
> > > On Mon, May 06, 2024 at 09:52:35PM +0300, Juha-Pekka Heikkila
> > > wrote:
> > > > These patches introduce I915_FORMAT_MOD_4_TILED_XE2_CCS
> > > > modifier, which,
> > > > from the kernel's perspective, behaves similarly to
> > `I915_FORMAT_MOD_4_TILED`.
> > > > This new modifier is primarily intended for user space to
> > > > effectively
> > monitor
> > > > compression status, especially when dealing with a mix of
> > > > compressed and
> > > > uncompressed buffers.
> > > >
> > > > The addition of this modifier facilitates user space in
> > > > managing
> > compression
> > > > status, particularly when utilizing both compressed and
> > > > uncompressed
> > buffers
> > > > concurrently. To leverage compression for these buffers, user
> > > > space
> > > > applications must configure the appropriate Page Attribute
> > > > Table (PAT)
> > index.
> > > > Display engine will treat all Tile4 as if it were compressed
> > > > under all
> > > > circumstances on Xe2 architecture.
> > >
> > > I may have missed some discussion about this, but I thought the
> > > previous
> > > consensus was that we didn't want/need new modifiers for
> > > compression on
> > > Xe2? If a userspace client (or the display hardware) receives a
> > > buffer
> > > of unknown origin and unknown compression status, it's always
> > > fine to
> > > select a compressed PAT when binding the buffer to read since
> > > even for
> > > uncompressed buffers the CCS metadata will accurately reflect the
> > > compression status. Unlike Xe1, where generating content without
> > > compression enabled would leave random garbage in the FlatCCS
> > > area, Xe2
> > > will set the corresponding FlatCCS to '0x0' for each block,
> > > indicating
> > > uncompressed data.
> > >
> > > Can you explain more what the benefit of handling these modifiers
> > > explicitly is?
I am not sure if there is another nice way to address this case in the eco-system without an explicit modifier even if it is not necessary from kernel driver aspect, so I just put it here to get
some insights.
I assume we still want to support existing modifiers like I915_FORMAT_MOD_4_TILED, then the modifier extension of Vulkan becomes relevant (https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_drm_format_modifier.html).
In the section of export, the spec depicts the driver can choose the optimal modifier to create the image from a list of modifiers application provides, which could be all modifiers reported by driver from a prior query. Then the application will export the
modifier along with other necessary meta data and buffer. For these legacy modifiers like TILE_4, we can disable compression on it since it is defined without compression schemas in drm_fourcc.h, or we can keep the image compressed internally but need to uncompress
it before exporting. Either of the two options has a performance penalty, but this won't be a concern when we have an explicit TILE_4+compression modifier (e.g. DG2, MTL) to choose. The rest of export-import process has no ambiguity on compression support,
so no resolve or special treatment is needed. Without such modifier, what should driver assign to image or report back to application? In this case we can only have these legacy modifiers as options with penalty.
There was also a discussion on assuming shared data always compressed between both ends to eliminate the need of a new modifier on Xe2, but it has to be defined between the parties in design and
cannot guarantee data is still sharable between different devices.
It seems to me that the drm modifier has been there from a while, always portraited as beneficial enhancements and get promoted, so more apps would be on the path with modifiers by default. On
the other hand, even if it turns out that this explicit modifier is not useful, it won't hurt anyway as the last drm modifier inherited by later Hardwares. It seems still better and safer to have it then getting caught by some corner cases on the road.
If there is another logical path to deal it without such modifier, that will be great too.
> > >
> > >
> > > Matt
> >
> > Thanks, Matt! I'm a bit late in getting up to speed with the Xe2
> > compression
> > changes; this is really good information.
> >
> > As I understand it...all blocks on the GPU behave in the way you
> > mentioned,
> > where generating uncompressed data via the GPU will set FlatCCS =
> > 0, so you
> > can assume a compressed PAT entry and everything works.
> >
> > One snag is...I've heard that CPU access doesn't work that way.
> > So, if you
> > mmap a buffer on the CPU, and write data with the CPU, then I think
> > we're back
> > to the "FlatCCS contains uninitialized garbage" case, where it's
> > unsafe to
> > assume a compressed PAT. And... we don't really know when sharing
> > buffers
> > whether the other side is going to want to do CPU access.
>
> I think the previous discussion has specifically happened in the
> context of
> dma-buf, so not only CPU but other GPUs/accelerators/decoders/devices
> in the
> system are also relevant.
>
> It boils down to the fact that when exporting a dma-buf, one can't
> know it will
> be consumed only by the same GPU (or other device for that matter)
> unless there
> is an explicit negotiation between exporter and importers.
>
> > It would be really nice to assume compression by default, though,
> > which got me
> > thinking: if we mmap a buffer via DRM_XE_GEM_MMAP_OFFSET, could
> > xe.ko disable
> > compression for us? So, resolve any outstanding CCS data, and then
> > switch any
> > PAT entries to uncompressed. Mapping would block until that
> > resolve is done.
> > It could leave compression off forever (once you CPU map a buffer,
> > it's never
> > compressed again). Or it could turn CCS back on when map count
> > reaches 0 (but
> > frankly I'm not sure that's terribly important, and sounds more
> > complex).
>
> This would only really work for a single device but the dma-buf is
> specifically targeting more generic sharing. There's no built-in
> mechanism
> to limit the sharing to subset of devices without explicit
> negotiation
> between the exporter and importers.
>
> So I think the "by default" mode needs to be interoperable, and the
> explicit negotiation can then use less compatible formats given those
> FD
> are never passed to importers that were not part of the negotiation.
>
> > As I understand it, at least on discrete GPUs, the kernel already
> > has to do
> > something similar for eviction, when migrating BOs to system memory
> > (which
> > doesn't support compression). So this would be doing basically the
> > same
> > "resolve and disable CCS" step the kernel can presumably already
> > do, but now
> > on mmap as well.
>
> So far the eviction strategy has been to copy both the backing store
> and
> compression bits in raw form. With Xe2 it would indeed be possible to
> do
> an implicit resolve IFF the buffer has not been shared to someone who
> doesn't
> understand compression and might have left garbage in the CCS bits.
>
> When evicting in raw form, kernel doesn't have to know if the CCS
> bits
> are garbage or not on any given moment.
>
> Regards, Joonas
Just a follow-up comment (TBC), IF we're going for the everything-is-
compressed approach, I think there are some considerations to be made:
dma-buf exports to foreign devices need to resolve at map_attachment
time. Foreign devices are all devices that can't interpret the
compressed content.
dma-buf mmaps need to resolve. IIRC Could be implemented in the
DMA_BUF_IOCTL_SYNC callbacks that wrap cpu access.
dma-buf imports from foreign devices need to never use compressed PAT
for writing. Should KMD enforce this? Implicitly? Explicitly? I don't
see how UMD could know whether the imported dma-buf is from a foreign
device.
For mmaps of buffer objects of local device a resolve is needed. Having
KMD do that on a pagefault-basis is definitely possible, but will most
likely be terribly inefficient. Better to leave that to UMD?
/Thomas
>
> >
> > What do you think? Viable? Crazy? Have I missed something?
> >
> > --Ken
|