Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

Felix Kuehling <felix.kuehling@xxxxxxx> · Wed, 15 Feb 2023 17:18:35 -0500

Am 2023-02-15 um 05:44 schrieb Christian König:
Am 15.02.23 um 03:51 schrieb Xiao, Shane:
For public review
-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Wednesday, February 15, 2023 3:02 AM
To: Xiao, Shane <shane.xiao@xxxxxxx>; Paneer Selvam, Arunpravin
<Arunpravin.PaneerSelvam@xxxxxxx>
Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when 
using
drm buddy

Am 14.02.23 um 15:53 schrieb Xiao, Shane:
-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Tuesday, February 14, 2023 8:41 PM
To: Xiao, Shane <shane.xiao@xxxxxxx>; brahma_sw_dev
<brahma_sw_dev@xxxxxxx>
Cc: Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx>
Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
using drm buddy

Am 14.02.23 um 12:18 schrieb Shane Xiao:
Since the VRAM manager changed from drm mm to drm buddy. It's not
necessary to allocate 2MB aligned VRAM for more than 2MB unaligned
size, and then do trim. This method improves the allocation
efficiency and reduces memory fragmentation.
Well that is a trade off.

Allocating the BO as one contiguous chunk and then trimming is
beneficial because if we then later need it contiguous we don't need
to re-allocate and copy. This can be needed to display something for
example.
Hi Christian,

This case means that you allocate BO that is unnecessary to be 
continuous at first time,
and latter the BO should be continuous. I'm not familiar with 
display. Could you give me
a few more specific examples ?

On most generations DCE/DCN hardware needs the buffer contiguous to be 
able to scanout from it.

Only newer APUs can use S/G to scanout from system memory pages.

Yes, I agree that one contiguous chunk may get beneficial sometimes.
But as far as I know, you cannot guarantee that amdgpu_vram_mgr_new
can get one contiguous chunk  if you don't set TTM_PL_FLAG_CONTIGUOUS
flags.
For example, if you want to allocate 4M+4K BO, it will allocate one 
4M block
+ one 2M block which is unnecessary to be continuous, then 2M block 
will be
trimmed.

Oh, that's indeed not something which should happen. Sounds more like a
bug fix then.
Yes, I think this case should not be happened.
Actually, I'm not sure that why the allocated BO should be aligned 
with pages_per_block, which is set to 2MB by default.
Does this help improve performance when allocating 2M or above BO?
 From my point of view, the TLB may be one of reason of this. But I'm 
not sure about this.

Yes, we try to use allocations which are as contiguous as much as 
possible for better TLB usage.

Especially for some compute use cases this can make a >20% performance 
difference.
We actually found that >2MB virtual address alignment was hurting 
performance due to cache line aliasing. So we can't take advantage of 
>2MB pages in our page tables.

Regards,
  Felix



Regards,
Christian.


Best Regards,
Shane

On the other hand I completely agree allocating big and then trimming
creates more fragmentation than necessary.

Do you have some test case which can show the difference?
I have use rocrtst to show the difference.
The attachment is shown that after applying this patch, the order < 
9 total
vram size decrease from 99MB to 43MB.
And the latter has more higher order block memory.
Arun can you take a look? That problem here sounds important.

Thanks,
Christian.

BTW: No need to discuss that on the internal mailing list, please use
the public one instead.

I will send it to public. Thank you for your remind.

Best Regards,
Shane

Regards,
Christian.

Signed-off-by: Shane Xiao <shane.xiao@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
    1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 75c80c557b6e..3fea58f9427c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
ttm_resource_manager *man,
            /* Limit maximum size to 2GiB due to SG table 
limitations */
            size = min(remaining_size, 2ULL << 30);

-        if (size >= (u64)pages_per_block << PAGE_SHIFT)
+        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
                min_block_size = (u64)pages_per_block <<
PAGE_SHIFT;
            cur_size = size;