Re: [REGRESSION] Laptop with Ryzen 4600H fails to resume video since 5.17.4 (works 5.17.3)

Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> · Wed, 25 May 2022 09:29:19 +0200

On 24.05.22 22:54, Christian Casteyde wrote:
> Conclusion from gitlab discussion

Christian, Mario, thx for the updates. I'll remove the issue from the
list of tracked regressions then:

#regzbot invalid: tricky situation with other problems; the issue thus
not really qualifies as regression

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

> There are three different problems at stake here:
> 
> 1. First suspend failure after boot is due to TAD driver.
> When it is loaded, first suspend deep fails.
> This is the root cause of the following sequence.
> TAD seems to be not completely functionnal on my laptop, sysfs returns IO 
> errors (missing handlers in dmesg).
> After the first failure, every "deep" attempt work.
> 
> 2.  When suspend deep fails, elogind (used by Slackware 15) falls back to 
> s2idle.
> This behaviour is documented in logind.conf man page, and normaly can be 
> configured (side note: I didn't manage to do so, it ignores my configuration 
> file).
> 
> 3. When suspend to s2idle, the GPU fails to suspend and need a reset.
> This reset must be done in pm_suspend (not totally ok if reset at resume).
> However, the laptop indeed goes into s2idle.
> In this state, the power button awakes it: this part is handled by the BIOS 
> and not the distribution (which shut downs if not suspended).
> This is what doesn't work anymore un 5.17.4 and 5.18.
> 
> The root cause is therefore the ACPI TAD preventing the first deep suspend to 
> complete, then elogind asking for a s2idle in fallback, then s2idle leaving 
> the APU in inconsistent state, that can only be fixed by a reset in 
> pm_suspend, and not pm_suspend_noirq or pm_suspend_late.
> 
> I will open a separate bug for the ACPI TAD problem. For now I will run 
> without this driver, as deep suspend works fine in this case and s2idle is 
> therefore useless for me.
> 
> CC
> 
> 
> Le lundi 23 mai 2022, 19:03:27 CEST Christian Casteyde a écrit :
>> I've opened the gitlab entry for this discussion:
>> https://gitlab.freedesktop.org/drm/amd/-/issues/2023
>>
>> I confirm I 'm not receiving mails anymore from the mailing list but I'll
>> follow gitlab.
>>
>> In reply to the patch proposed by Mario:
>> https://patchwork.freedesktop.org/patch/486836/
>> With this patch applied on vanilla 5.18 kernel:
>> - suspend still fails;
>> - after suspend attempt, the screen comes back with only the cursor;
>> - switching to a console let me get the following dmesg file.
>>
>> CC
>>
>> Le lundi 23 mai 2022, 15:02:53 CEST Christian Casteyde a écrit :
>>> Hello
>>>
>>> I've checked with 5.18 the problem is still there.
>>> Interestingly, I tried to revert the commit but it was rejected because of
>>>
>>> the change in the test from:
>>>         if (!adev->in_s0ix)
>>>
>>> to:
>>>       if (amdgpu_acpi_should_gpu_reset(adev))
>>>
>>> in amdgpu_pmops_suspend.
>>>
>>> I fixed the rejection, keeping shoud_gpu_reset, but it still fails.
>>> Then I changed to restore test of in_s0ix as it was in 5.17, and it works.
>>> I tried with a call to amd_gpu_asic_reset without testing at all in_s0ix,
>>> it works.
>>>
>>> Therefore, my APU wants a reset in amdgpu_pmops_suspend.
>>>
>>> By curiosity, I tried to do the reset in amdgpu_pmops_suspend_noirq as was
>>> intended in 5.18 original code, commenting out the test of
>>> amdgpu_acpi_should_gpu_reset(adev) (since this APU wants a reset).
>>> This does not work, I got the Fence timeout errors or freezes.
>>>
>>> If I leave  noirq function unchanged (original 5.18 code), and just add a
>>> reset in suspend() as was done in 5.17, it works.
>>>
>>> Therefore, my GPU does NOT want to be reset in noirq, the reset must be in
>>> suspend.
>>>
>>> In other words, I modified amdgpu_pmops_suspend (partial revert) like this
>>> and this works on my laptop:
>>>
>>> static int amdgpu_pmops_suspend(struct device *dev)
>>> {
>>>
>>> 	struct drm_device *drm_dev = dev_get_drvdata(dev);
>>> 	struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>
>>> +	int r;
>>>
>>> 	if (amdgpu_acpi_is_s0ix_active(adev))
>>> 	
>>> 		adev->in_s0ix = true;
>>> 	
>>> 	else
>>> 	
>>> 		adev->in_s3 = true;
>>>
>>> -	return amdgpu_device_suspend(drm_dev, true);
>>> +	r = amdgpu_device_suspend(drm_dev, true);
>>> +	if (r)
>>> +		return r;
>>> +	if (!adev->in_s0ix)
>>> +		return amdgpu_asic_reset(adev);
>>>
>>> 	return 0;
>>>
>>> }
>>>
>>> static int amdgpu_pmops_suspend_noirq(struct device *dev)
>>> {
>>>
>>> 	struct drm_device *drm_dev = dev_get_drvdata(dev);
>>> 	struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>> 	
>>> 	if (amdgpu_acpi_should_gpu_reset(adev))
>>> 	
>>> 		return amdgpu_asic_reset(adev);
>>> 	
>>> 	return 0;
>>>
>>> }
>>>
>>> I don't know if other APU want a reset, in the same context, and how to
>>> differentiate all the cases, so I cannot go further, but I can test
>>> patches
>>> if needed.
>>>
>>> CC
>>>
>>> Le mercredi 18 mai 2022, 08:37:27 CEST Thorsten Leemhuis a écrit :
>>>> On 18.05.22 07:54, Kai-Heng Feng wrote:
>>>>> On Wed, May 18, 2022 at 1:52 PM Thorsten Leemhuis
>>>>>
>>>>> <regressions@xxxxxxxxxxxxx> wrote:
>>>>>> On 17.05.22 19:37, casteyde.christian@xxxxxxx wrote:
>>>>>>> I've tryied to revert the offending commit on 5.18-rc7 (887f75cfd0da
>>>>>>> ("drm/amdgpu: Ensure HDA function is suspended before ASIC reset"),
>>>>>>> and
>>>>>>> the problem disappears so it's really this commit that breaks.
>>>>>>
>>>>>> In that case I'll update the regzbot status to make sure it's visible
>>>>>> as
>>>>>> regression introduced in the 5.18 cycle:
>>>>>>
>>>>>> #regzbot introduced: 887f75cfd0da
>>>>>>
>>>>>> BTW: obviously would be nice to get this fixed before 5.18 is
>>>>>> released
>>>>>> (which might already happen on Sunday), especially as the culprit
>>>>>> apparently was already backported to stable, but I guess that won't
>>>>>> be
>>>>>> easy...
>>>>>>
>>>>>> Which made me wondering: is reverting the culprit temporarily in
>>>>>> mainline (and reapplying it later with a fix) a option here?
>>>>>
>>>>> It's too soon to call it's the culprit.
>>>>
>>>> Well, sure, the root-cause might be somewhere else. But from the point
>>>> of kernel regressions (and tracking them) it's the culprit, as that's
>>>> the change that triggers the misbehavior. And that's how Linus
>>>> approaches these things as well when it comes to reverting to fix
>>>> regressions -- and he even might...
>>>>
>>>>> The suspend on the system
>>>>> doesn't work properly at the first place.
>>>>
>>>> ...ignore things like this, as long as a revert is unlikely to cause
>>>> more damage than good.
>>>>
>>>> Ciao. Thorsten
>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker'
>>>>>> hat)
>>>>>>
>>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of
>>>>>> reports and sometimes miss something important when writing mails
>>>>>> like
>>>>>> this. If that's the case here, don't hesitate to tell me in a public
>>>>>> reply, it's in everyone's interest to set the public record straight.
> 
> 
> 
> 
>