RE: [REGRESSION] Laptop with Ryzen 4600H fails to resume video since 5.17.4 (works 5.17.3)

"Limonciello, Mario" <Mario.Limonciello@xxxxxxx> · Mon, 23 May 2022 14:00:46 +0000

[Public]

> -----Original Message-----
> From: Christian Casteyde <casteyde.christian@xxxxxxx>
> Sent: Monday, May 23, 2022 08:03
> To: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>; Thorsten Leemhuis
> <regressions@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx; regressions@xxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>; gregkh@xxxxxxxxxxxxxxxxxxx; Limonciello,
> Mario <Mario.Limonciello@xxxxxxx>
> Subject: Re: [REGRESSION] Laptop with Ryzen 4600H fails to resume video since
> 5.17.4 (works 5.17.3)
> 
> Hello
> 
> I've checked with 5.18 the problem is still there.
> Interestingly, I tried to revert the commit but it was rejected because of the
> change in the test from:
>         if (!adev->in_s0ix)
> to:
>       if (amdgpu_acpi_should_gpu_reset(adev))
> 
> in amdgpu_pmops_suspend.
> 
> I fixed the rejection, keeping shoud_gpu_reset, but it still fails.
> Then I changed to restore test of in_s0ix as it was in 5.17, and it works.
> I tried with a call to amd_gpu_asic_reset without testing at all in_s0ix, it
> works.
> 
> Therefore, my APU wants a reset in amdgpu_pmops_suspend.
> 
> By curiosity, I tried to do the reset in amdgpu_pmops_suspend_noirq as was
> intended in 5.18 original code, commenting out the test of
> amdgpu_acpi_should_gpu_reset(adev) (since this APU wants a reset).
> This does not work, I got the Fence timeout errors or freezes.
> 
> If I leave  noirq function unchanged (original 5.18 code), and just add a
> reset in suspend() as was done in 5.17, it works.
> 
> Therefore, my GPU does NOT want to be reset in noirq, the reset must be in
> suspend.
> 
> In other words, I modified amdgpu_pmops_suspend (partial revert) like this and
> this works on my laptop:
> 
> static int amdgpu_pmops_suspend(struct device *dev)
> {
> 	struct drm_device *drm_dev = dev_get_drvdata(dev);
> 	struct amdgpu_device *adev = drm_to_adev(drm_dev);
> +	int r;
> 
> 	if (amdgpu_acpi_is_s0ix_active(adev))
> 		adev->in_s0ix = true;
> 	else
> 		adev->in_s3 = true;
> -	return amdgpu_device_suspend(drm_dev, true);
> +	r = amdgpu_device_suspend(drm_dev, true);
> +	if (r)
> +		return r;
> +	if (!adev->in_s0ix)
> +		return amdgpu_asic_reset(adev);
> 	return 0;
> }
> 
> static int amdgpu_pmops_suspend_noirq(struct device *dev)
> {
> 	struct drm_device *drm_dev = dev_get_drvdata(dev);
> 	struct amdgpu_device *adev = drm_to_adev(drm_dev);
> 
> 	if (amdgpu_acpi_should_gpu_reset(adev))
> 		return amdgpu_asic_reset(adev);
> 
> 	return 0;
> }
> 
> I don't know if other APU want a reset, in the same context, and how to
> differentiate all the cases, so I cannot go further, but I can test patches if
> needed.

The core of this problem is that your first suspend fails and the GPU is in a bad
state for the next suspend.

I'm not sure why you're ignoring my other emails, but I did suggest a different
approach in this thread here:

https://patchwork.freedesktop.org/patch/486836/

Thanks,

> 
> CC
> 
> Le mercredi 18 mai 2022, 08:37:27 CEST Thorsten Leemhuis a écrit :
> > On 18.05.22 07:54, Kai-Heng Feng wrote:
> > > On Wed, May 18, 2022 at 1:52 PM Thorsten Leemhuis
> > >
> > > <regressions@xxxxxxxxxxxxx> wrote:
> > >> On 17.05.22 19:37, casteyde.christian@xxxxxxx wrote:
> > >>> I've tryied to revert the offending commit on 5.18-rc7 (887f75cfd0da
> > >>> ("drm/amdgpu: Ensure HDA function is suspended before ASIC reset"), and
> > >>> the problem disappears so it's really this commit that breaks.
> > >>
> > >> In that case I'll update the regzbot status to make sure it's visible as
> > >> regression introduced in the 5.18 cycle:
> > >>
> > >> #regzbot introduced: 887f75cfd0da
> > >>
> > >> BTW: obviously would be nice to get this fixed before 5.18 is released
> > >> (which might already happen on Sunday), especially as the culprit
> > >> apparently was already backported to stable, but I guess that won't be
> > >> easy...
> > >>
> > >> Which made me wondering: is reverting the culprit temporarily in
> > >> mainline (and reapplying it later with a fix) a option here?
> > >
> > > It's too soon to call it's the culprit.
> >
> > Well, sure, the root-cause might be somewhere else. But from the point
> > of kernel regressions (and tracking them) it's the culprit, as that's
> > the change that triggers the misbehavior. And that's how Linus
> > approaches these things as well when it comes to reverting to fix
> > regressions -- and he even might...
> >
> > > The suspend on the system
> > > doesn't work properly at the first place.
> >
> > ...ignore things like this, as long as a revert is unlikely to cause
> > more damage than good.
> >
> > Ciao. Thorsten
> >
> > >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > >>
> > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> > >> reports and sometimes miss something important when writing mails like
> > >> this. If that's the case here, don't hesitate to tell me in a public
> > >> reply, it's in everyone's interest to set the public record straight.
> 
> 
>