Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on Arcturus

Alex Deucher <alexdeucher@xxxxxxxxx> · Tue, 18 Jan 2022 15:15:09 -0500

On Tue, Jan 18, 2022 at 3:13 PM Eric Huang <jinhuieric.huang@xxxxxxx> wrote:
>
> I understand Alex's concern. I think usually we only check the version
> when a feature is only available in a specific version, and other
> version or newer version doesn't have.
>
> In case of FW fix, we assume the driver and FWs have to be synchronous.
> If we have driver backward compatibility for FWs, there must be a lot of
> redundant codes for FW version check. So this patch and SDMA fix will be
> pushed into ROCm 5.1 release branch at the same time.

We need to check.  The firmwares are not distributed in lock step with
the driver.

Alex

>
> Regards,
> Eric
>
> On 2022-01-18 14:35, Alex Deucher wrote:
> > On Tue, Jan 18, 2022 at 2:27 PM Russell, Kent <Kent.Russell@xxxxxxx> wrote:
> >> [AMD Official Use Only]
> >>
> >> I think what he means is that if we are using SDMA v17, this will cause issues, won't it? Should we check that SDMA version is >=18 before enabling it? Or am I misunderstanding the fix?
> > Yes, that was my concern.
> >
> > Alex
> >
> >>   Kent
> >>
> >>> -----Original Message-----
> >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Eric Huang
> >>> Sent: Tuesday, January 18, 2022 2:09 PM
> >>> To: Alex Deucher <alexdeucher@xxxxxxxxx>
> >>> Cc: amd-gfx list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>
> >>> Subject: Re: [PATCH] drm/amdkfd: enable heavy-weight TLB flush on Arcturus
> >>>
> >>> The SDMA fix is generic and not in a specific version of FW, so we don't
> >>> have to check.
> >>>
> >>> Thanks,
> >>> Eric
> >>>
> >>> On 2022-01-18 11:35, Alex Deucher wrote:
> >>>> On Tue, Jan 18, 2022 at 11:16 AM Eric Huang <jinhuieric.huang@xxxxxxx> wrote:
> >>>>> SDMA FW fixes the hang issue for adding heavy-weight TLB
> >>>>> flush on Arcturus, so we can enable it.
> >>>> Do we need to check for a specific fw version?
> >>>>
> >>>> Alex
> >>>>
> >>>>> Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx>
> >>>>> ---
> >>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 8 +++++---
> >>>>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c         | 3 ++-
> >>>>>    2 files changed, 7 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>>>> index a64cbbd943ba..7b24a920c12e 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>>>> @@ -1892,10 +1892,12 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
> >>>>>                                   true);
> >>>>>           ret = unreserve_bo_and_vms(&ctx, false, false);
> >>>>>
> >>>>> -       /* Only apply no TLB flush on Aldebaran to
> >>>>> -        * workaround regressions on other Asics.
> >>>>> +       /* Only apply no TLB flush on Aldebaran and Arcturus
> >>>>> +        * to workaround regressions on other Asics.
> >>>>>            */
> >>>>> -       if (table_freed && (adev->asic_type != CHIP_ALDEBARAN))
> >>>>> +       if (table_freed &&
> >>>>> +           (adev->asic_type != CHIP_ALDEBARAN) &&
> >>>>> +           (adev->asic_type != CHIP_ARCTURUS))
> >>>>>                   *table_freed = true;
> >>>>>
> >>>>>           goto out;
> >>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> >>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> >>>>> index b570c0454ce9..ef4d676cc71c 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> >>>>> @@ -1806,7 +1806,8 @@ static int kfd_ioctl_unmap_memory_from_gpu(struct file
> >>> *filep,
> >>>>>           }
> >>>>>           mutex_unlock(&p->mutex);
> >>>>>
> >>>>> -       if (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2)) {
> >>>>> +       if (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> >>>>> +           KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1)) {
> >>>>>                   err = amdgpu_amdkfd_gpuvm_sync_memory(dev->adev,
> >>>>>                                   (struct kgd_mem *) mem, true);
> >>>>>                   if (err) {
> >>>>> --
> >>>>> 2.25.1
> >>>>>
>