RE: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using drm buddy

"Xiao, Shane" <shane.xiao@xxxxxxx> · Fri, 17 Feb 2023 03:11:45 +0000

> -----Original Message-----
> From: Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx>
> Sent: Thursday, February 16, 2023 3:24 PM
> To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Xiao, Shane
> <shane.xiao@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>;
> Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using
> drm buddy
> 
> This patch seems to pass the rocm memory stress test case.
> Reviewed-by: Arunpravin Paneer Selvam
> <Arunpravin.PaneerSelvam@xxxxxxx>

Hi Christian,

Do you think we should upstream this patch?

Best Regards,
Shane

> 
> On 2/16/2023 12:39 PM, Christian König wrote:
> > Am 16.02.23 um 07:48 schrieb Xiao, Shane:
> >>> -----Original Message-----
> >>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> >>> Sent: Thursday, February 16, 2023 6:19 AM
> >>> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Xiao, Shane
> >>> <shane.xiao@xxxxxxx>; Koenig, Christian
> <Christian.Koenig@xxxxxxx>;
> >>> Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx>
> >>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when
> >>> using drm buddy
> >>>
> >>>
> >>> Am 2023-02-15 um 05:44 schrieb Christian König:
> >>>> Am 15.02.23 um 03:51 schrieb Xiao, Shane:
> >>>>> For public review
> >>>>>> -----Original Message-----
> >>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> >>>>>> Sent: Wednesday, February 15, 2023 3:02 AM
> >>>>>> To: Xiao, Shane <shane.xiao@xxxxxxx>; Paneer Selvam,
> Arunpravin
> >>>>>> <Arunpravin.PaneerSelvam@xxxxxxx>
> >>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
> >>>>>> when using drm buddy
> >>>>>>
> >>>>>> Am 14.02.23 um 15:53 schrieb Xiao, Shane:
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> >>>>>>>> Sent: Tuesday, February 14, 2023 8:41 PM
> >>>>>>>> To: Xiao, Shane <shane.xiao@xxxxxxx>; brahma_sw_dev
> >>>>>>>> <brahma_sw_dev@xxxxxxx>
> >>>>>>>> Cc: Paneer Selvam, Arunpravin
> >>> <Arunpravin.PaneerSelvam@xxxxxxx>
> >>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation
> >>> when
> >>>>>>>> using drm buddy
> >>>>>>>>
> >>>>>>>> Am 14.02.23 um 12:18 schrieb Shane Xiao:
> >>>>>>>>> Since the VRAM manager changed from drm mm to drm buddy.
> It's
> >>> not
> >>>>>>>>> necessary to allocate 2MB aligned VRAM for more than 2MB
> >>>>>>>>> unaligned size, and then do trim. This method improves the
> >>>>>>>>> allocation efficiency and reduces memory fragmentation.
> >>>>>>>> Well that is a trade off.
> >>>>>>>>
> >>>>>>>> Allocating the BO as one contiguous chunk and then trimming is
> >>>>>>>> beneficial because if we then later need it contiguous we don't
> >>>>>>>> need to re-allocate and copy. This can be needed to display
> >>>>>>>> something for
> >>>>>> example.
> >>>>> Hi Christian,
> >>>>>
> >>>>> This case means that you allocate BO that is unnecessary to be
> >>>>> continuous at first time, and latter the BO should be continuous.
> >>>>> I'm not familiar with display. Could you give me a few more
> >>>>> specific examples ?
> >>>> On most generations DCE/DCN hardware needs the buffer contiguous
> to
> >>> be
> >>>> able to scanout from it.
> >>>>
> >>>> Only newer APUs can use S/G to scanout from system memory pages.
> >>>>
> >>>>>>> Yes, I agree that one contiguous chunk may get beneficial
> >>>>>>> sometimes.
> >>>>>>> But as far as I know, you cannot guarantee that
> >>> amdgpu_vram_mgr_new
> >>>>>> can get one contiguous chunk  if you don't set
> >>>>>> TTM_PL_FLAG_CONTIGUOUS flags.
> >>>>>>> For example, if you want to allocate 4M+4K BO, it will allocate
> >>>>>>> one 4M block
> >>>>>> + one 2M block which is unnecessary to be continuous, then 2M
> >>>>>> + block
> >>>>>> will be
> >>>>>> trimmed.
> >>>>>>
> >>>>>> Oh, that's indeed not something which should happen. Sounds more
> >>>>>> like a bug fix then.
> >>>>> Yes, I think this case should not be happened.
> >>>>> Actually, I'm not sure that why the allocated BO should be aligned
> >>>>> with pages_per_block, which is set to 2MB by default.
> >>>>> Does this help improve performance when allocating 2M or above BO?
> >>>>>   From my point of view, the TLB may be one of reason of this. But
> >>>>> I'm not sure about this.
> >>>> Yes, we try to use allocations which are as contiguous as much as
> >>>> possible for better TLB usage.
> >>>>
> >>>> Especially for some compute use cases this can make a >20%
> >>>> performance difference.
> >>> We actually found that >2MB virtual address alignment was hurting
> >>> performance due to cache line aliasing. So we can't take advantage
> >>> of  >2MB pages in our page tables.
> >>>
> >>> Regards,
> >>>     Felix
> >> Yes, if we want to take advantage of 2M TLB usage, we should keep
> >> virtual address aligned.
> >>
> >> As you have mentioned that cache line aliasing issue, I'm confused
> >> about this.
> >> If 2MB aligned VA get the right PA from TLB or page table and the
> >> cache line addressing mode is not changed, the cache line aliasing
> >> issue should not happen here.
> >> Is there something wrong with my understanding? Or maybe there are
> >> some backgrounds that I didn't know.
> >
> > The problem is with virtual address alignments > 2MiB (or whatever the
> > big cache line size is).
> >
> > Let's assume an example where you have a lot of buffer each 66MiB in
> > size. When you align those to 2MiB in the virtual address space you
> > end up with
> >
> > 64MiB..2MiB..62MiB..4MiB..60MiB... etc...
> >
> > In your address space. In this configuration each 2MiB cache line is
> > equally used.
> >
> > But if you align the buffers to say the next power of two (128MiB) you
> > end up like this:
> >
> > 64MiB..2MiB..62MiB hole..64MiB..2MiB..62MiB hole... etc....
> >
> > In this case the first 2MiB cache line of each buffer is used twice as
> > much as all the other cache lines. This can hurt performance very badly.
> >
> > Regards,
> > Christian.
> >
> >> Best Regards,
> >> Shane
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Best Regards,
> >>>>> Shane
> >>>>>
> >>>>>>>> On the other hand I completely agree allocating big and then
> >>>>>>>> trimming creates more fragmentation than necessary.
> >>>>>>>>
> >>>>>>>> Do you have some test case which can show the difference?
> >>>>>>> I have use rocrtst to show the difference.
> >>>>>>> The attachment is shown that after applying this patch, the
> >>>>>>> order <
> >>>>>>> 9 total
> >>>>>> vram size decrease from 99MB to 43MB.
> >>>>>>> And the latter has more higher order block memory.
> >>>>>> Arun can you take a look? That problem here sounds important.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Christian.
> >>>>>>
> >>>>>>>> BTW: No need to discuss that on the internal mailing list,
> >>>>>>>> please use the public one instead.
> >>>>>>>>
> >>>>>>> I will send it to public. Thank you for your remind.
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> Shane
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>>> Signed-off-by: Shane Xiao <shane.xiao@xxxxxxx>
> >>>>>>>>> ---
> >>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +-
> >>>>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> >>>>>>>>> index 75c80c557b6e..3fea58f9427c 100644
> >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> >>>>>>>>> @@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct
> >>>>>>>> ttm_resource_manager *man,
> >>>>>>>>>              /* Limit maximum size to 2GiB due to SG table
> >>>>>>>>> limitations */
> >>>>>>>>>              size = min(remaining_size, 2ULL << 30);
> >>>>>>>>>
> >>>>>>>>> -        if (size >= (u64)pages_per_block << PAGE_SHIFT)
> >>>>>>>>> +        if (!(size % ((u64)pages_per_block << PAGE_SHIFT)))
> >>>>>>>>>                  min_block_size = (u64)pages_per_block <<
> >>>>>>>> PAGE_SHIFT;
> >>>>>>>>>              cur_size = size;
> >