> -----Original Message----- > From: Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx> > Sent: Thursday, February 16, 2023 3:24 PM > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Xiao, Shane > <shane.xiao@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; > Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when using > drm buddy > > This patch seems to pass the rocm memory stress test case. > Reviewed-by: Arunpravin Paneer Selvam > <Arunpravin.PaneerSelvam@xxxxxxx> Hi Christian, Do you think we should upstream this patch? Best Regards, Shane > > On 2/16/2023 12:39 PM, Christian König wrote: > > Am 16.02.23 um 07:48 schrieb Xiao, Shane: > >>> -----Original Message----- > >>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > >>> Sent: Thursday, February 16, 2023 6:19 AM > >>> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Xiao, Shane > >>> <shane.xiao@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; > >>> Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam@xxxxxxx> > >>> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation when > >>> using drm buddy > >>> > >>> > >>> Am 2023-02-15 um 05:44 schrieb Christian König: > >>>> Am 15.02.23 um 03:51 schrieb Xiao, Shane: > >>>>> For public review > >>>>>> -----Original Message----- > >>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> > >>>>>> Sent: Wednesday, February 15, 2023 3:02 AM > >>>>>> To: Xiao, Shane <shane.xiao@xxxxxxx>; Paneer Selvam, > Arunpravin > >>>>>> <Arunpravin.PaneerSelvam@xxxxxxx> > >>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation > >>>>>> when using drm buddy > >>>>>> > >>>>>> Am 14.02.23 um 15:53 schrieb Xiao, Shane: > >>>>>>>> -----Original Message----- > >>>>>>>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> > >>>>>>>> Sent: Tuesday, February 14, 2023 8:41 PM > >>>>>>>> To: Xiao, Shane <shane.xiao@xxxxxxx>; brahma_sw_dev > >>>>>>>> <brahma_sw_dev@xxxxxxx> > >>>>>>>> Cc: Paneer Selvam, Arunpravin > >>> <Arunpravin.PaneerSelvam@xxxxxxx> > >>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: optimize VRAM allocation > >>> when > >>>>>>>> using drm buddy > >>>>>>>> > >>>>>>>> Am 14.02.23 um 12:18 schrieb Shane Xiao: > >>>>>>>>> Since the VRAM manager changed from drm mm to drm buddy. > It's > >>> not > >>>>>>>>> necessary to allocate 2MB aligned VRAM for more than 2MB > >>>>>>>>> unaligned size, and then do trim. This method improves the > >>>>>>>>> allocation efficiency and reduces memory fragmentation. > >>>>>>>> Well that is a trade off. > >>>>>>>> > >>>>>>>> Allocating the BO as one contiguous chunk and then trimming is > >>>>>>>> beneficial because if we then later need it contiguous we don't > >>>>>>>> need to re-allocate and copy. This can be needed to display > >>>>>>>> something for > >>>>>> example. > >>>>> Hi Christian, > >>>>> > >>>>> This case means that you allocate BO that is unnecessary to be > >>>>> continuous at first time, and latter the BO should be continuous. > >>>>> I'm not familiar with display. Could you give me a few more > >>>>> specific examples ? > >>>> On most generations DCE/DCN hardware needs the buffer contiguous > to > >>> be > >>>> able to scanout from it. > >>>> > >>>> Only newer APUs can use S/G to scanout from system memory pages. > >>>> > >>>>>>> Yes, I agree that one contiguous chunk may get beneficial > >>>>>>> sometimes. > >>>>>>> But as far as I know, you cannot guarantee that > >>> amdgpu_vram_mgr_new > >>>>>> can get one contiguous chunk if you don't set > >>>>>> TTM_PL_FLAG_CONTIGUOUS flags. > >>>>>>> For example, if you want to allocate 4M+4K BO, it will allocate > >>>>>>> one 4M block > >>>>>> + one 2M block which is unnecessary to be continuous, then 2M > >>>>>> + block > >>>>>> will be > >>>>>> trimmed. > >>>>>> > >>>>>> Oh, that's indeed not something which should happen. Sounds more > >>>>>> like a bug fix then. > >>>>> Yes, I think this case should not be happened. > >>>>> Actually, I'm not sure that why the allocated BO should be aligned > >>>>> with pages_per_block, which is set to 2MB by default. > >>>>> Does this help improve performance when allocating 2M or above BO? > >>>>> From my point of view, the TLB may be one of reason of this. But > >>>>> I'm not sure about this. > >>>> Yes, we try to use allocations which are as contiguous as much as > >>>> possible for better TLB usage. > >>>> > >>>> Especially for some compute use cases this can make a >20% > >>>> performance difference. > >>> We actually found that >2MB virtual address alignment was hurting > >>> performance due to cache line aliasing. So we can't take advantage > >>> of >2MB pages in our page tables. > >>> > >>> Regards, > >>> Felix > >> Yes, if we want to take advantage of 2M TLB usage, we should keep > >> virtual address aligned. > >> > >> As you have mentioned that cache line aliasing issue, I'm confused > >> about this. > >> If 2MB aligned VA get the right PA from TLB or page table and the > >> cache line addressing mode is not changed, the cache line aliasing > >> issue should not happen here. > >> Is there something wrong with my understanding? Or maybe there are > >> some backgrounds that I didn't know. > > > > The problem is with virtual address alignments > 2MiB (or whatever the > > big cache line size is). > > > > Let's assume an example where you have a lot of buffer each 66MiB in > > size. When you align those to 2MiB in the virtual address space you > > end up with > > > > 64MiB..2MiB..62MiB..4MiB..60MiB... etc... > > > > In your address space. In this configuration each 2MiB cache line is > > equally used. > > > > But if you align the buffers to say the next power of two (128MiB) you > > end up like this: > > > > 64MiB..2MiB..62MiB hole..64MiB..2MiB..62MiB hole... etc.... > > > > In this case the first 2MiB cache line of each buffer is used twice as > > much as all the other cache lines. This can hurt performance very badly. > > > > Regards, > > Christian. > > > >> Best Regards, > >> Shane > >>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>> Best Regards, > >>>>> Shane > >>>>> > >>>>>>>> On the other hand I completely agree allocating big and then > >>>>>>>> trimming creates more fragmentation than necessary. > >>>>>>>> > >>>>>>>> Do you have some test case which can show the difference? > >>>>>>> I have use rocrtst to show the difference. > >>>>>>> The attachment is shown that after applying this patch, the > >>>>>>> order < > >>>>>>> 9 total > >>>>>> vram size decrease from 99MB to 43MB. > >>>>>>> And the latter has more higher order block memory. > >>>>>> Arun can you take a look? That problem here sounds important. > >>>>>> > >>>>>> Thanks, > >>>>>> Christian. > >>>>>> > >>>>>>>> BTW: No need to discuss that on the internal mailing list, > >>>>>>>> please use the public one instead. > >>>>>>>> > >>>>>>> I will send it to public. Thank you for your remind. > >>>>>>> > >>>>>>> Best Regards, > >>>>>>> Shane > >>>>>>> > >>>>>>>> Regards, > >>>>>>>> Christian. > >>>>>>>> > >>>>>>>>> Signed-off-by: Shane Xiao <shane.xiao@xxxxxxx> > >>>>>>>>> --- > >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 2 +- > >>>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>>>>>>>> > >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > >>>>>>>>> index 75c80c557b6e..3fea58f9427c 100644 > >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > >>>>>>>>> @@ -453,7 +453,7 @@ static int amdgpu_vram_mgr_new(struct > >>>>>>>> ttm_resource_manager *man, > >>>>>>>>> /* Limit maximum size to 2GiB due to SG table > >>>>>>>>> limitations */ > >>>>>>>>> size = min(remaining_size, 2ULL << 30); > >>>>>>>>> > >>>>>>>>> - if (size >= (u64)pages_per_block << PAGE_SHIFT) > >>>>>>>>> + if (!(size % ((u64)pages_per_block << PAGE_SHIFT))) > >>>>>>>>> min_block_size = (u64)pages_per_block << > >>>>>>>> PAGE_SHIFT; > >>>>>>>>> cur_size = size; > >