RE: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

"Liang, Prike" <Prike.Liang@xxxxxxx> · Fri, 26 Jan 2024 08:04:04 +0000

[AMD Official Use Only - General]

> From: Alex Deucher <alexdeucher@xxxxxxxxx>
> Sent: Friday, January 26, 2024 6:57 AM
> To: Liang, Prike <Prike.Liang@xxxxxxx>
> Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; Sharma, Deepak
> <Deepak.Sharma@xxxxxxx>
> Subject: Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case
>
> On Wed, Jan 24, 2024 at 11:11 PM Prike Liang <Prike.Liang@xxxxxxx> wrote:
> >
> > In the pm abort case the gfx power rail not turn off from FCH side and
> > this will lead to the gfx reinitialized failed base on the unknown gfx
> > HW status, so let's reset the gpu to a known good power state.
> >
> > Signed-off-by: Prike Liang <Prike.Liang@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
> >  drivers/gpu/drm/amd/amdgpu/soc15.c         | 8 +++++++-
> >  2 files changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 56d9dfa61290..4c40ffaaa5c2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4627,6 +4627,11 @@ int amdgpu_device_resume(struct drm_device
> *dev, bool fbcon)
> >                         return r;
> >         }
> >
> > +       if(amdgpu_asic_need_reset_on_init(adev)) {
>
> space between if and (.
>
> Also, I think we need to check that we are not in S0ix as well otherwise I think
> we'll always reset in S0ix.  We could probably do away with the GPU reset in
> the suspend_noirq callback with this change, but maybe make that a separate
> follow up patch.
>
> Alex
>
[Liang, Prike] Yes, there needn't reset GPU for the s0ix suspend abort case and s0ix suspension was already
filtered out in the need_reset_on_init() at the end of this patch. Regards to reset GPU at suspend_noirq callback,
there already sort out some cases that require to reset at amdgpu_acpi_should_gpu_reset() and if do the reset
at the no_irq suspend phase then will miss the first suspend abort case. IMO, we need filer out the following two cases and update the
patch to be a more generic solution accordingly.

1) The PM suspend abort case which is exit PM suspension process before runs into no_irq suspend phase.
2)  The PM suspend completely but the GPU doesn't power off and PSP TOS is alive at the amdgpu resume early beginning and this case
happed at passthrough device suspension at Xen guest side.

> > +               DRM_INFO("PM abort case and let's reset asic \n");
> > +               amdgpu_asic_reset(adev);
> > +       }
> > +
> >         if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> >                 return 0;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
> > b/drivers/gpu/drm/amd/amdgpu/soc15.c
> > index 15033efec2ba..9329a00b6abc 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> > @@ -804,9 +804,16 @@ static bool soc15_need_reset_on_init(struct
> amdgpu_device *adev)
> >         if (adev->asic_type == CHIP_RENOIR)
> >                 return true;
> >
> > +       sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> > +
> >         /* Just return false for soc15 GPUs.  Reset does not seem to
> >          * be necessary.
> >          */
> > +       if (adev->in_suspend && !adev->in_s0ix &&
> > +                       !adev->pm_complete &&
> > +                       sol_reg)
> > +               return true;
> > +
> >         if (!amdgpu_passthrough(adev))
> >                 return false;
> >
> > @@ -816,7 +823,6 @@ static bool soc15_need_reset_on_init(struct
> amdgpu_device *adev)
> >         /* Check sOS sign of life register to confirm sys driver and sOS
> >          * are already been loaded.
> >          */
> > -       sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> >         if (sol_reg)
> >                 return true;
> >
> > --
> > 2.34.1
> >