On 12/28/2017 10:24 AM, Miguel Angel Vico wrote:
(Adding dri-devel back, and trying to respond to some comments from
the different forks)
James Jones wrote:
Your worst case analysis above isn't far off from our HW, give or take
some bits and axes here and there. We've started an internal discussion
about how to lay out all the bits we need. It's hard to even enumerate
them all without having a complete understanding of what capability sets
are going to include, a fully-optimized implementation of the mechanism
on our HW, and lot's of test scenarios though.
(thanks James for most of the info below)
To elaborate a bit, if we want to share an allocation across GPUs for 3D
rendering, it seems we would need 12 bits to express our
swizzling/tiling memory layouts for fermi+. In addition to that,
maxwell uses 3 more bits for this, and we need an extra bit to identify
pre-fermi representations.
We also need one bit to differentiate between Tegra and desktop, and
another one to indicate whether the layout is otherwise linear.
Then things like whether compression is used (one more bit), and we can
probably get by with 3 bits for the type of compression if we are
creative. However, it'd be way easier to just track arch + page kind,
which would be like 32 bits on its own.
Not clear if this is an NV-only term, so for those not familiar, page
kind is very loosely the equivalent of a format modifier our HW uses
internally in its memory management subsystem. The value mappings vary
a bit for each HW generation.
Whether Z-culling and/or zero-bandwidth-clears are used may be another 3
bits.
If device-local properties are included, we might need a couple more
bits for caching.
We may also need to express locality information, which may take at
least another 2 or 3 bits.
If we want to share array textures too, you also need to pass the array
pitch. Is it supposed to be encoded in a modifier too? That's 64 bits on
its own.
So yes, as James mentioned, with some effort, we could technically fit
our current allocation parameters in a modifier, but I'm still not
convinced this is as future proof as it could be as our hardware grows
in capabilities.
Daniel Stone wrote:
So I reflexively
get a bit itchy when I see the kernel being used to transit magic
blobs of data which are supplied by userspace, and only interpreted by
different userspace. Having tiling formats hidden away means that
we've had real-world bugs in AMD hardware, where we end up displaying
garbage because we cannot generically reason about the buffer
attributes.
I'm a bit confused. Can't modifiers be specified by vendors and only
interpreted by drivers? My understanding was that modifiers could
actually be treated as opaque 64-bit data, in which case they would
qualify as "magic blobs of data". Otherwise, it seems this wouldn't be
scalable. What am I missing?
Daniel Vetter wrote:
I think in the interim figuring out how to expose kms capabilities
better (and necessarily standardizing at least some of them which
matter at the compositor level, like size limits of framebuffers)
feels like the place to push the ecosystem forward. In some way
Miguel's proposal looks a bit backwards, since it adds the pitch
capabilities to addfb, but at addfb time you've allocated everything
already, so way too late to fix things up. With modifiers we've added
a very simple per-plane property to list which modifiers can be
combined with which pixel formats. Tiny start, but obviously very far
from all that we'll need.
Not sure whether I might be misunderstanding your statement, but one of
the allocator main features is negotiation of nearly optimal allocation
parameters given a set of uses on different devices/engines by the
capability merge operation. A client should have queried what every
device/engine is capable of for the given uses, find the optimal set of
capabilities, and use it for allocating a buffer. At the moment these
parameters are given to KMS, they are expected to be good. If they
aren't, the client didn't do things right.
Rob Clark wrote:
It does seem like, if possible, starting out with modifiers for now at
the kernel interface would make life easier, vs trying to reinvent
both kernel and userspace APIs at the same time. Userspace APIs are
easier to change or throw away. Presumably by the time we get to the
point of changing kernel uabi, we are already using, and pretty happy
with, serialized liballoc data over the wire in userspace so it is
only a matter of changing the kernel interface.
I guess we can indeed start with modifiers for now, if that's what it
takes to get the allocator mechanisms rolling. However, it seems to me
that we won't be able to encode the same type of information included
in capability sets with modifiers in all cases. For instance, if we end
up encoding usage transition information in capability sets, how that
would translate to modifiers?
I assume display doesn't really care about a lot of the data capability
sets may encode, but is it correct to think of modifiers as things only
display needs? If we are to treat modifiers as a first-class citizen, I
would expect to use them beyond that.
Right, this becomes a lot more interesting when modifiers or capability
sets start getting used to share things from Vulkan<->Vulkan, for
example. Of course, we don't need to change kernel ABIs for that, but
wayland protocols, Vulkan extensions, etc. might need modification.
Regardless, I agree with Miguel's sentiment. Let's at least defer this
debate a bit until we know more about what capability sets look like.
If modifiers alone still seem sufficient, so be it.
Kristian Kristensen wrote:
I agree and let me elaborate a bit. The problem we're seeing isn't that we
need more that 2^56 modifiers for a future GPU. The problem is that flags
like USE_SCANOUT (which your allocator proposal essentially keeps) is
inadequate. The available tiling and compression formats vary with which
(in KMS terms) CRTC you want to use, which plane you're on whether you want
rotation or no and how much you want to scale etc. It's not realistic to
think that we could model this in a centralized allocator library that's
detached from the display driver. To be fair, this is not a point about
blobs vs modifiers, it's saying that the use flags don't belong in the
allocator, they belong in the APIs that will be using the buffer - and not
as literal use flags, but as a way to discover supported modifiers for a
given use case.
Why detached from the display driver? I don't see why there couldn't be
an allocator driver with access to display capabilities that can be
used in the negotiation step to find the optimal set of allocation
parameters.
In addition, speaking to some other portions of your response, most of
the usage in the prototype is placeholder stuff for testing.
USE_SCANNOUT is partially expanded to include orientation as well, which
helps in some cases on our hardware. If there's more complex stuff for
other display hardware, it needs to be expanded further, or that HW is
free to expose a vendor-specific usage, since usage is extensible. It's
easy to mirror in all the relevant usage flags from other APIs or
engines too. That's a rather small amount of duplication.
The important part is the logic that selects optimal usage. I don't
think it's possible to select optimal usage with the queries spread
around all the APIs. Vulkan isn't going to know about video encode
usage. In many situations it won't know about display usage. It just
knows optimal texture/render usage. Therefore it can't optimize
parameters for usage it doesn't know about it. A centralized allocator
can, especially when all the usage ends up delegated to a single
device/GPU. It will have all the same information available to it on
the back end because it can access DRM devices, v4l devices, etc. to
query their capabilities via allocator backends, but it can have more
information available on the front end from the app, and a more complete
solution returned from a driver that is able to parse and consider that
additional information.
Additionally, I again offer the goal of an optimal gralloc
implementation built on top of the allocator mechanism. I find it
difficult to imagine building gralloc on top of Vulkan or EGL and DRM.
Does such a solution seem feasible to you? I've not researched this
significantly myself, but Google Android engineers shared that concern
when we had the initial discussions at XDC 2016.
Kristian Kristensen wrote:
I understand that you may have n knobs with a total of more than a total of
56 bits that configure your tiling/swizzling for color buffers. What I don't
buy is that you need all those combinations when passing buffers around
between codecs, cameras and display controllers. Even if you're sharing
between the same 3D drivers in different processes, I expect just locking
down, say, 64 different combinations (you can add more over time) and
assigning each a modifier would be sufficient. I doubt you'd extract
meaningful performance gains from going all the way to a blob.
If someone has N knobs available, I don't understand why there
shouldn't be a mechanism that allows making use of them all, regardless
of performance numbers.
Daniel Vetter wrote:
Yeah, that part was all clear. I'd want more details of what exact
kind of metadata. fast-clear colors? tiling layouts? aux data for the
compressor? hiz (or whatever you folks call it) tree?
As you say, we've discussed massive amounts of different variants on
this, and there's different answers for different questions. Consensus
seems to be that bigger stuff (compression data, hiz, clear colors,
...) should be stored in aux planes, while the exact layout and what
kind of aux planes you have are encoded in the modifier.
My understanding is that capability sets may include all metadata you
mentioned. Besides tiling/swizzling layout and compression parameters,
things like zero-bandwidth-clears (I guess the same or similar to
fast-clear colors?), hiz-like data, device-local properties such as
caches, or locality information could/will be also included in a
capability set. We are even considering encoding some sort of usage
transition information in the capability set itself.
I think there's some nuance here. The format of compression metadata
would clearly be a capability set thing. The compression data itself
would indeed be in some auxiliary surface on most/all hardware. Things
like fast clears are harder to nail down because implementations seem
more varied there. It might be very awkward on some hardware to put the
necessary meta-data in a DRM FB plane, while that might be the only
reasonable way to accomplish it on other hardware. I think we'll have
to work through some corner cases across lots of hardware before this
bottoms out.
Thanks,
-James
Thanks,
Miguel.
_______________________________________________
mesa-dev mailing list
mesa-dev@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel