RE: [PATCH] amdgpu: Don't print L2 status if there's nothing to print

"Russell, Kent" <Kent.Russell@xxxxxxx> · Fri, 18 Oct 2024 18:45:32 +0000



[Public]

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> Sent: Friday, October 18, 2024 2:43 PM
> To: Russell, Kent <Kent.Russell@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Cornwall, Jay <Jay.Cornwall@xxxxxxx>
> Subject: Re: [PATCH] amdgpu: Don't print L2 status if there's nothing to print
>
>
> On 2024-10-18 11:12, Kent Russell wrote:
> > If a 2nd fault comes in before the 1st is handled, the 1st fault will
> > clear out the FAULT STATUS registers before the 2nd fault is handled.
> > Thus we get a lot of zeroes. If status=0, just skip the L2 fault status
> > information, to avoid confusion of why some VM fault status prints in
> > dmesg are all zeroes.
> >
> > Signed-off-by: Kent Russell <kent.russell@xxxxxxx>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 10 ++++++++++
> >   drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 10 ++++++++++
> >   drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 10 ++++++++++
> >   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 10 ++++++++++
> >   4 files changed, 40 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> > index 5cf2002fcba8..14a52c33bffa 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> > @@ -175,6 +175,16 @@ static int gmc_v10_0_process_interrupt(struct
> amdgpu_device *adev,
> >                     addr, entry->client_id,
> >                     soc15_ih_clientid_name[entry->client_id]);
> >
> > +   /* If status=0, we don't have anything to process. Most often, this is
> > +    * caused by a 2nd fault coming in before the 1st one is handled.
> > +    * Once the 1st fault is handled, the fault registers are cleared, so
> > +    * we have nothing to print for fault #2. In that case, don't try to
> > +    * print the fault status information. The information above is
> > +    * sufficient to note that another fault occurred.
> > +    */
> > +   if (!status)
> > +           return 0;
>
> I'm not sure if this lengthy comment needs to be duplicated in so many
> places. If you check for status == 0, you also don't need the
> amdgpu_sriov_vf condition below any more, because it doesn't read the
> status register under SRIOV and status is 0-initialized. So you could
> just do this:
>
>       /* Only print L2 fault status if the status register could be read and
>        * contains useful information
>        */
>       if (status != 0)
>               hub->vmhub_funcs->print_l2_protection_fault_status(adev,
>                                                                  status);
Sounds good. I can leave the verbose description in the commit and cut it down in these spots. Thanks!

 Kent
>
> Regards,
>    Felix
>
>
> > +
> >     if (!amdgpu_sriov_vf(adev))
> >             hub->vmhub_funcs->print_l2_protection_fault_status(adev,
> >                                                                status);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > index 4df4d73038f8..c5fe31070cf3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > @@ -144,6 +144,16 @@ static int gmc_v11_0_process_interrupt(struct
> amdgpu_device *adev,
> >             dev_err(adev->dev, "  in page starting at address 0x%016llx from
> client %d\n",
> >                             addr, entry->client_id);
> >
> > +           /* If status=0, we don't have anything to process. Most often, this is
> > +            * caused by a 2nd fault coming in before the 1st one is handled.
> > +            * Once the 1st fault is handled, the fault registers are cleared, so
> > +            * we have nothing to print for fault #2. In that case, don't try to
> > +            * print the fault status information. The information above is
> > +            * sufficient to note that another fault occurred.
> > +            */
> > +           if (!status)
> > +                   return 0;
> > +
> >             if (!amdgpu_sriov_vf(adev))
> >                     hub->vmhub_funcs->print_l2_protection_fault_status(adev,
> status);
> >     }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> > index e33f9e9058cc..fcfe512f1271 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> > @@ -137,6 +137,16 @@ static int gmc_v12_0_process_interrupt(struct
> amdgpu_device *adev,
> >             dev_err(adev->dev, "  in page starting at address 0x%016llx from
> client %d\n",
> >                             addr, entry->client_id);
> >
> > +           /* If status=0, we don't have anything to process. Most often, this is
> > +            * caused by a 2nd fault coming in before the 1st one is handled.
> > +            * Once the 1st fault is handled, the fault registers are cleared, so
> > +            * we have nothing to print for fault #2. In that case, don't try to
> > +            * print the fault status information. The information above is
> > +            * sufficient to note that another fault occurred.
> > +            */
> > +           if (!status)
> > +                   return 0;
> > +
> >             if (!amdgpu_sriov_vf(adev))
> >                     hub->vmhub_funcs->print_l2_protection_fault_status(adev,
> status);
> >     }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > index 010db0e58650..6da2ca28375e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> > @@ -675,6 +675,16 @@ static int gmc_v9_0_process_interrupt(struct
> amdgpu_device *adev,
> >     if (!amdgpu_sriov_vf(adev))
> >             WREG32_P(hub->vm_l2_pro_fault_cntl, 1, ~1);
> >
> > +   /* If status=0, we don't have anything to process. Most often, this is
> > +    * caused by a 2nd fault coming in before the 1st one is handled.
> > +    * Once the 1st fault is handled, the fault registers are cleared, so
> > +    * we have nothing to print for fault #2. In that case, don't try to
> > +    * print the fault status information. The information above is
> > +    * sufficient to note that another fault occurred.
> > +    */
> > +   if (!status)
> > +           return 0;
> > +
> >     amdgpu_vm_update_fault_cache(adev, entry->pasid, addr, status, vmhub);
> >
> >     dev_err(adev->dev,