[AMD Official Use Only] > -----Original Message----- > From: Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx> > Sent: Wednesday, January 19, 2022 10:01 AM > To: Russell, Kent <Kent.Russell@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on Arcturus > > > > On 2022-01-19 09:50, Russell, Kent wrote: > > [AMD Official Use Only] > > > >> -----Original Message----- > >> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > >> Sent: Tuesday, January 18, 2022 7:16 PM > >> To: Russell, Kent <Kent.Russell@xxxxxxx>; Huang, JinHuiEric > >> <JinHuiEric.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Subject: Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on Arcturus > >> > >> Am 2022-01-18 um 7:08 p.m. schrieb Russell, Kent: > >>> One question inline > >>> > >>> > >>> *KENT RUSSELL*** > >>> > >>> Sr. Software Engineer | Linux Compute Kernel > >>> > >>> 1 Commerce Valley Drive East > >>> > >>> Markham, ON L3T 7X6 > >>> > >>> *O*+(1) 289-695-2122**| Ext 72122 > >>> > >>> > >>> ------------------------------------------------------------------------ > >>> *From:* amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of > >>> Felix Kuehling <felix.kuehling@xxxxxxx> > >>> *Sent:* Tuesday, January 18, 2022 6:36 PM > >>> *To:* Huang, JinHuiEric <JinHuiEric.Huang@xxxxxxx>; > >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx> > >>> *Subject:* Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on > >>> Arcturus > >>> > >>> Am 2022-01-18 um 5:45 p.m. schrieb Eric Huang: > >>>> SDMA FW fixes the hang issue for adding heavy-weight TLB > >>>> flush on Arcturus, so we can enable it. > >>>> > >>>> Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx> > >>> Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > >>> > >>> > >>>> --- > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ------ > >>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 10 ++++++++-- > >>>> 2 files changed, 8 insertions(+), 8 deletions(-) > >>>> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>>> index a64cbbd943ba..acb4fd973e60 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > >>>> @@ -1892,12 +1892,6 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu( > >>>> true); > >>>> ret = unreserve_bo_and_vms(&ctx, false, false); > >>>> > >>>> - /* Only apply no TLB flush on Aldebaran to > >>>> - * workaround regressions on other Asics. > >>>> - */ > >>>> - if (table_freed && (adev->asic_type != CHIP_ALDEBARAN)) > >>>> - *table_freed = true; > >>>> - > >>>> goto out; > >>>> > >>>> out_unreserve: > >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > >>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > >>>> index b570c0454ce9..485d4c52c7de 100644 > >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > >>>> @@ -1596,6 +1596,12 @@ static int > >>> kfd_ioctl_free_memory_of_gpu(struct file *filep, > >>>> return ret; > >>>> } > >>>> > >>>> +static bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev) { > >>>> + return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) > >>> Do we need to add a check for sdma ver >=8 here? > >> What's the significance of version 8 for Aldebaran? This code was > >> working on Aldebaran without a version check before. Did we ever > >> publicly release an SDMA firmware older than version 8 that for Aldebaran? > > We released v5 for Aldebaran SDMA in ROCm 4.5 . If I remember the ticket correctly, the > same fix for Arcturus was required for Aldebaran and was part of SDMA v8. But Eric is > obviously watching the ticket more closely than I, so I'll defer to him there. > Yes. Aldebaran has the same bug as Arcturus in SDMA, but the bug doesn't > cause GPU hang on Aldebaran. As Felix said heavy-weight TLB flush have > been working on Aldebaran since it was enabled, so we don't need to > check the version for it. Ah perfect, thanks for the clarification! Kent > > Regards, > Eric > > > > Kent > > > >> Regards, > >> Felix > >> > >> > >>> || > >>>> + (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && > >>>> + dev->adev->sdma.instance[0].fw_version >= 18); > >>>> +} > >>>> + > >>>> static int kfd_ioctl_map_memory_to_gpu(struct file *filep, > >>>> struct kfd_process *p, void > >>> *data) > >>>> { > >>>> @@ -1692,7 +1698,7 @@ static int kfd_ioctl_map_memory_to_gpu(struct > >>> file *filep, > >>>> } > >>>> > >>>> /* Flush TLBs after waiting for the page table updates to > >>> complete */ > >>>> - if (table_freed) { > >>>> + if (table_freed || !kfd_flush_tlb_after_unmap(dev)) { > >>>> for (i = 0; i < args->n_devices; i++) { > >>>> peer = kfd_device_by_id(devices_arr[i]); > >>>> if (WARN_ON_ONCE(!peer)) > >>>> @@ -1806,7 +1812,7 @@ static int > >>> kfd_ioctl_unmap_memory_from_gpu(struct file *filep, > >>>> } > >>>> mutex_unlock(&p->mutex); > >>>> > >>>> - if (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2)) { > >>>> + if (kfd_flush_tlb_after_unmap(dev)) { > >>>> err = amdgpu_amdkfd_gpuvm_sync_memory(dev->adev, > >>>> (struct kgd_mem *) mem, true); > >>>> if (err) {