Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

Christian König <christian.koenig@xxxxxxx> · Thu, 16 Feb 2023 08:09:15 +0100

Am 16.02.23 um 07:48 schrieb Xiao, Shane:
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, February 16, 2023 6:19 AM
To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Xiao, Shane
<shane.xiao@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>;
Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx>
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using
drm buddy


Am 2023-02-15 um 05:44 schrieb Christian König:
Am 15.02.23 um 03:51 schrieb Xiao, Shane:
For public review
-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Wednesday, February 15, 2023 3:02 AM
To: Xiao, Shane <shane.xiao@xxxxxxx>; Paneer Selvam, Arunpravin
<Arunpravin.PaneerSelvam@xxxxxxx>
Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
using drm buddy

Am 14.02.23 um 15:53 schrieb Xiao, Shane:
-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Tuesday, February 14, 2023 8:41 PM
To: Xiao, Shane <shane.xiao@xxxxxxx>; brahma_sw_dev
<brahma_sw_dev@xxxxxxx>
Cc: Paneer Selvam, Arunpravin
<Arunpravin.PaneerSelvam@xxxxxxx>
Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
when
using drm buddy

Am 14.02.23 um 12:18 schrieb Shane Xiao:
Since the VRAM manager changed from drm mm to drm buddy. It's
not
necessary to allocate 2MB aligned VRAM for more than 2MB
unaligned size, and then do trim. This method improves the
allocation efficiency and reduces memory fragmentation.
Well that is a trade off.

Allocating the BO as one contiguous chunk and then trimming is
beneficial because if we then later need it contiguous we don't
need to re-allocate and copy. This can be needed to display
something for
example.
Hi Christian,

This case means that you allocate BO that is unnecessary to be
continuous at first time, and latter the BO should be continuous. I'm
not familiar with display. Could you give me a few more specific
examples ?
On most generations DCE/DCN hardware needs the buffer contiguous to
be
able to scanout from it.

Only newer APUs can use S/G to scanout from system memory pages.

Yes, I agree that one contiguous chunk may get beneficial sometimes.
But as far as I know, you cannot guarantee that
amdgpu_vram_mgr_new
can get one contiguous chunk  if you don't set
TTM_PL_FLAG_CONTIGUOUS flags.
For example, if you want to allocate 4M+4K BO, it will allocate one
4M block
+ one 2M block which is unnecessary to be continuous, then 2M block
will be
trimmed.

Oh, that's indeed not something which should happen. Sounds more
like a bug fix then.
Yes, I think this case should not be happened.
Actually, I'm not sure that why the allocated BO should be aligned
with pages_per_block, which is set to 2MB by default.
Does this help improve performance when allocating 2M or above BO?
  From my point of view, the TLB may be one of reason of this. But I'm
not sure about this.
Yes, we try to use allocations which are as contiguous as much as
possible for better TLB usage.

Especially for some compute use cases this can make a >20% performance
difference.
We actually found that >2MB virtual address alignment was hurting
performance due to cache line aliasing. So we can't take advantage of  >2MB
pages in our page tables.

Regards,
    Felix
Yes, if we want to take advantage of 2M TLB usage, we should keep virtual address aligned.

As you have mentioned that cache line aliasing issue, I'm confused about this.
If 2MB aligned VA get the right PA from TLB or page table and the cache line addressing mode is not changed,
the cache line aliasing issue should not happen here.
Is there something wrong with my understanding? Or maybe there are some backgrounds that I didn't know.

The problem is with virtual address alignments > 2MiB (or whatever the 
big cache line size is).

Let's assume an example where you have a lot of buffer each 66MiB in 
size. When you align those to 2MiB in the virtual address space you end 
up with

64MiB..2MiB..62MiB..4MiB..60MiB... etc...

In your address space. In this configuration each 2MiB cache line is 
equally used.

But if you align the buffers to say the next power of two (128MiB) you 
end up like this:

64MiB..2MiB..62MiB hole..64MiB..2MiB..62MiB hole... etc....

In this case the first 2MiB cache line of each buffer is used twice as 
much as all the other cache lines. This can hurt performance very badly.

Regards,
Christian.

Best Regards,
Shane
  

Regards,
Christian.

Best Regards,
Shane

On the other hand I completely agree allocating big and then
trimming creates more fragmentation than necessary.

Do you have some test case which can show the difference?
I have use rocrtst to show the difference.
The attachment is shown that after applying this patch, the order <
9 total
vram size decrease from 99MB to 43MB.
And the latter has more higher order block memory.
Arun can you take a look? That problem here sounds important.

Thanks,
Christian.

BTW: No need to discuss that on the internal mailing list, please
use the public one instead.

I will send it to public. Thank you for your remind.

Best Regards,
Shane

Regards,
Christian.

Signed-off-by: Shane Xiao <shane.xiao@xxxxxxx>
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
     1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 75c80c557b6e..3fea58f9427c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
ttm_resource_manager *man,
             /* Limit maximum size to 2GiB due to SG table
limitations */
             size = min(remaining_size, 2ULL << 30);

-        if (size >= (u64)pages_per_block << PAGE_SHIFT)
+        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
                 min_block_size = (u64)pages_per_block <<
PAGE_SHIFT;
             cur_size = size;