Re: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> · Wed, 23 Oct 2024 15:27:05 +0100

On 23/10/2024 14:31, Li, Yunxiang (Teddy) wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx>
Sent: Wednesday, October 23, 2024 8:25
On 23/10/2024 13:12, Christian König wrote:
Am 23.10.24 um 13:37 schrieb Tvrtko Ursulin:

On 23/10/2024 10:14, Christian König wrote:
Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:

On 22/10/2024 17:24, Christian König wrote:
Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
[Public]

+static uint32_t fold_memtype(uint32_t memtype) {
In general please add prefixes to even static functions, e.g.
amdgpu_vm_ or
amdgpu_bo_.

+   /* Squash private placements into 'cpu' to keep the legacy
userspace view.
*/
+   switch (mem_type) {
+   case TTM_PL_VRAM:
+   case TTM_PL_TT:
+           return memtype
+   default:
+           return TTM_PL_SYSTEM;
+   }
+}
+
+static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
That whole function belongs into amdgpu_bo.c
Do you mean bo_get_memtype or fold_memtype? I debated whether
bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and
since it's using fold_memtype and only useful for memory stats
because of folding the private placements I just left them here
together with the other mem stats code.

I can move it to amdgpu_bo.c make it return the memtype verbatim
and just fold it when I do the accounting.

I think that folding GDS, GWS and OA into system is also a bug. We
should really not doing that.

Just wanted to point out for this round that the code to query the
current placement from a BO should probably go into amdgpu_bo.c
and not amdgpu_vm.c

+   struct ttm_resource *res = bo->tbo.resource;
+   const uint32_t domain_to_pl[] = {
+           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      =
+TTM_PL_SYSTEM,
+           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
+           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     =
TTM_PL_VRAM,
+           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      =
+AMDGPU_PL_GDS,
+           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      =
+AMDGPU_PL_GWS,
+           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       =
AMDGPU_PL_OA,
+           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
AMDGPU_PL_DOORBELL,
+   };
+   uint32_t domain;
+
+   if (res)
+           return fold_memtype(res->mem_type);
+
+   /*
+    * If no backing store use one of the preferred domain for
basic
+    * stats. We take the MSB since that should give a
+reasonable
+    * view.
+    */
+   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT ||
TTM_PL_VRAM <
TTM_PL_SYSTEM);
+   domain = fls(bo->preferred_domains &
+AMDGPU_GEM_DOMAIN_MASK);
+   if (drm_WARN_ON_ONCE(&adev->ddev,
+                        domain == 0 || --domain >=
ARRAY_SIZE(domain_to_pl)))
It's perfectly legal to create a BO without a placement. That
one just won't have a backing store.

This is lifted from the previous change I'm rebasing onto. I
think what it’s trying to do is if the BO doesn't have a
placement, use the "biggest" (VRAM > TT > SYSTEM) preferred
placement for the purpose of accounting. Previously we just
ignore BOs that doesn't have a placement. I guess there's
argument for going with either approaches.

I was not arguing, I'm simply pointing out a bug. It's perfectly
valid for bo->preferred_domains to be 0.

So the following WARN_ON() that no bit is set is incorrect.

+           return 0;
+   return fold_memtype(domain_to_pl[domain])
That would need specular execution mitigation if I'm not
completely mistaken.

Better use a switch/case statement.

Do you mean change the array indexing to a switch statement?

Yes.

Did you mean array_index_nospec?

Yes.

Domain is not a direct userspace input and is calculated from the
mask which sanitized to allowed values prior to this call. So I
*think* switch is an overkill but don't mind it either. Just
commenting FWIW.

I missed that the mask is applied.

Thinking more about it I'm not sure if we should do this conversion
in the first place. IIRC Tvrtko you once suggested a patch which
switched a bunch of code to use the TTM placement instead of the
UAPI flags.

Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double
indirection") is what are you thinking of?

Yes, exactly that one.

Going more into this direction I think when we want to look at the
current placement we should probably also use the TTM PL enumeration
directly.

It does this already. The placement flags are just to "invent" a TTM
PL enum when bo->tbo.resource == NULL.

Ah, good point! I though we would do the mapping the other way around.

In this case that is even more something we should probably not do at all.

When bo->tbo.resource is NULL then this BO isn't resident at all, so
it should not account to resident memory.

It doesn't, only for total. I should have pasted more context..:

       struct ttm_resource *res = bo->tbo.resource; ...
          /* DRM stats common fields: */

          stats[type].total += size;
          if (drm_gem_object_is_shared_for_memory_stats(obj))
                  stats[type].drm.shared += size;
          else
                  stats[type].drm.private += size;

          if (res) {
                  stats[type].drm.resident += size

So if no current placement it does not count towards drm-resident-, only
drm-total- (which is drm.private + drm.resident). Total and resident intend to be
analogue to for instance VIRT and RES in top(1), or VZS and RSS in ps(1).

Again, based of the same enum. Not sure if you have something other
in mind or you are happy with that?

I think that for drm-total-* we should use the GEM flags and for
drm-resident-* we should use the TTM placement.

Agreed! :)

Oof I missed the distinction between resident and total as well. Just want to double confirm the drm-total- semantics.

Does drm-total- track the BOs that prefer the placement (derived from the preferred domain) and drm-resident- track the actual placement, or does drm-total- track drm-resident- plus BOs that don't have a placement but prefers here?

Preferred domain is only used as an aid when there is no backing store. 
Putting that aside, it is supposed to be simple:

Total - BO exists, whether or not it currently has a backing store.

Resident - BO has a backing store.

Active and purgeable are a sub-variant of resident.

Again, preferred placement only comes into the picture (in the current 
implementation) when there is not backing store. Because we must account 
the total and looking at the preferred tells us to where.

Yeah it may be that one second you have:

total-vram: 4 KiB
resident-vram: 0
total-system: 0
resident-system: 0

And the second:

total-vram: 0
resident-vram: 0
total-system: 4 KiB
resident-system: 4 KiB

All with a single 4k BO, which happened to instantiate its backing store 
in it's 2nd preferred placement.

But IMO it is better than not accounting it under total anywhere in the 
first case.

Have I managed to explain it good enough?

Regards,

Tvrtko

Then what Teddy does is IMO only tangential, he just changes when
stats are collected and not this aspect.

Yeah, right but we should probably fix it up in the right way while on it.

Okay, we just need to align on is there a problem and how to fix it.

To fold or not the special placements (GWS, GDS & co) is also
tangential. In my patch I just preserved the legacy behaviour so it
can easily be tweaked on top.

Yeah, but again the original behavior is completely broken.

GWS, GDS and OA are counted in blocks of HW units (multiplied by
PAGE_SIZE IIRC to avoid some GEM&TTM warnings).

When you accumulate that anywhere in the memory stats then that is
just completely off.

Ooops. :) Are they backed by some memory though, be it system or VRAM?

Regards,

Tvrtko