Re: [RFC PATCH] drm/ttm: Add a private member to the struct ttm_resource

Christian König <christian.koenig@xxxxxxx> · Tue, 14 Sep 2021 10:53:26 +0200

Am 14.09.21 um 10:27 schrieb Thomas Hellström:
On Tue, 2021-09-14 at 09:40 +0200, Christian König wrote:
Am 13.09.21 um 14:41 schrieb Thomas Hellström:
[SNIP]
Let's say you have a struct ttm_object_vram and a struct
ttm_object_gtt, both subclassing drm_gem_object. Then I'd say
a
driver would want to subclass those to attach identical data,
extend functionality and provide a single i915_gem_object to
the
rest of the driver, which couldn't care less whether it's
vram or
gtt? Wouldn't you say having separate struct ttm_object_vram
and a
struct ttm_object_gtt in this case would be awkward?. We
*want* to
allow common handling.
Yeah, but that's a bad idea. This is like diamond inheritance
in C++.

When you need the same functionality in different backends you
implement that as separate object and then add a parent class.

It's the exact same situation here. With struct ttm_resource
you
let *different* implementation flavours subclass it, which
makes it
awkward for the driver to extend the functionality in a
common way
by subclassing, unless the driver only uses a single
implementation.
Well the driver should use separate implementations for their
different domains as much as possible.

Hmm, Now you lost me a bit. Are you saying that the way we do
dynamic
backends in the struct ttm_buffer_object to facilitate driver
subclassing is a bad idea or that the RFC with backpointer is a
bad
idea?

Or if you mean diamond inheritance is bad, yes that's basically my
point.
That diamond inheritance is a bad idea. What I don't understand is
why
you need that in the first place?

Information that you attach to a resource are specific to the domain
where the resource is allocated from. So why do you want to attach
the
same information to a resources from different domains?
Again, for the same reason that we do that with struct i915_gem_objects
and struct ttm_tts, to extend the functionality. I mean information
that we attach when we subclass a struct ttm_buffer_object doesn't
necessarily care about whether it's a VRAM or a GTT object. In exactly
the same way, information that we want to attach to a struct
ttm_resource doesn't necessarily care whether it's a system or a VRAM
resource, and need not be specific to any of those.

In this particular case, as memory management becomes asynchronous, you
can't attach things like sg-tables and gpu binding information to the
gem object anymore, because the object may have a number of migrations
in the pipeline. Such things need to be attached to the structure that
abstracts the memory allocation, and which may have a completely
different lifetime than the object itself.

In our particular case we want to attach information for cached page
lookup and and sg-table, and moving forward probably the gpu binding
(vma) information, and that is the same information for any
ttm_resource regardless where it's allocated from.

Typical example: A pipelined GPU operation happening before an async
eviction goes wrong. We need to error capture and reset. But if we look
at the object for error capturing, it's already updated pointing to an
after-eviction resource, and the resource sits on a ghost object (or in
the future when ghost objects go away perhaps in limbo somewhere).

We need to capture the memory pointed to by the struct ttm_resource the
GPU was referencing, and to be able to do that we need to cache driver-
specific info on the resource. Typically an sg-list and GPU binding
information.

Anyway, that cached information needs to be destroyed together with the
resource and thus we need to be able to access that information from
the resource in some way, regardless whether it's a pointer or whether
we embed the struct resource.

I think it's pretty important here that we (using the inheritance
diagram below) recognize the need for D to inherit from A, just like we
do for objects or ttm_tts.

Looking at
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FMultiple_inheritance%23%2Fmedia%2FFile%3ADiamond_inheritance.svg&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C268bb562db8548b285b408d977598b2c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637672048739103176%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=bPyDqiSF%2FHFZbl74ux0vfwh3uma5hZIUf2xbzb9yZz8%3D&amp;reserved=0

1)

A would be the struct ttm_resource itself,
D would be struct i915_resource,
B would be struct ttm_range_mgr_node,
C would be struct i915_ttm_buddy_resource

And we need to resolve the ambiguity using the awkward union
construct, iff we need to derive from both B and C.

Struct ttm_buffer_object and struct ttm_tt instead have B) and C)
being dynamic backends of A) or a single type derived from A) Hence
the problem doesn't exist for these types.

So the question from last email remains, if ditching this RFC, can
we
have B) and C) implemented by helpers that can be used from D) and
that don't derive from A?
Well we already have that in the form of drm_mm. I mean the
ttm_range_manager is just a relatively small glue code which
implements
the TTMs resource interface using the drm_mm object and a spinlock.
IIRC
that less than 200 lines of code.

So you should already have the necessary helpers and just need to
implement the resource manager as far as I can see.

I mean I reused the ttm_range_manager_node in for amdgpu_gtt_mgr and
could potentially reuse a bit more of the ttm_range_manager code. But
I
don't see that as much of an issue, the extra functionality there is
just minimal.
Sure but that would give up the prereq of having reusable resource
manager implementations. What happens if someone would like to reuse
the buddy manager? And to complicate things even more, the information
we attach to VRAM resources also needs to be attached to system
resources. Sure we could probably re-implement a combined system-buddy-
range manager, but that seems like something overly complex.

The other object examples resolve the diamond inheritance with a
pointer to the specialization (BC) and let D derive from A.

TTM resources do it backwards. If we can just recognize that and ponder
what's the easiest way to resolve this given the current design, I
actually think we'd arrive at a backpointer to allow downcasting from A
to D.

Yeah, but I think you are approaching that from the wrong side.

For use cases like this I think you should probably have the following 
objects and inheritances:

1. Driver specific objects like i915_sg, i915_vma which don't inherit 
anything from TTM.
2. i915_vram_node which inherits from ttm_resource or a potential 
ttm_buddy_allocator.
3. i915_gtt_node which inherits from ttm_range_manger_node.
4. Maybe i915_sys_node which inherits from ttm_resource as well.

The managers for the individual domains then provide the glue code to 
implement both the TTM resource interface as well as a driver specific 
interface to access the driver objects.

Amdgpu just uses a switch/case for now, but you could as well extend the 
ttm_resource_manager_func table and upcast that inside the driver.

Regards,
Christian.

Thanks,
Thomas

Regards,
Christian.

Thanks,

Thomas