On 24.05.22 22:54, Christian Casteyde wrote: > Conclusion from gitlab discussion Christian, Mario, thx for the updates. I'll remove the issue from the list of tracked regressions then: #regzbot invalid: tricky situation with other problems; the issue thus not really qualifies as regression Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight. > There are three different problems at stake here: > > 1. First suspend failure after boot is due to TAD driver. > When it is loaded, first suspend deep fails. > This is the root cause of the following sequence. > TAD seems to be not completely functionnal on my laptop, sysfs returns IO > errors (missing handlers in dmesg). > After the first failure, every "deep" attempt work. > > 2. When suspend deep fails, elogind (used by Slackware 15) falls back to > s2idle. > This behaviour is documented in logind.conf man page, and normaly can be > configured (side note: I didn't manage to do so, it ignores my configuration > file). > > 3. When suspend to s2idle, the GPU fails to suspend and need a reset. > This reset must be done in pm_suspend (not totally ok if reset at resume). > However, the laptop indeed goes into s2idle. > In this state, the power button awakes it: this part is handled by the BIOS > and not the distribution (which shut downs if not suspended). > This is what doesn't work anymore un 5.17.4 and 5.18. > > The root cause is therefore the ACPI TAD preventing the first deep suspend to > complete, then elogind asking for a s2idle in fallback, then s2idle leaving > the APU in inconsistent state, that can only be fixed by a reset in > pm_suspend, and not pm_suspend_noirq or pm_suspend_late. > > I will open a separate bug for the ACPI TAD problem. For now I will run > without this driver, as deep suspend works fine in this case and s2idle is > therefore useless for me. > > CC > > > Le lundi 23 mai 2022, 19:03:27 CEST Christian Casteyde a écrit : >> I've opened the gitlab entry for this discussion: >> https://gitlab.freedesktop.org/drm/amd/-/issues/2023 >> >> I confirm I 'm not receiving mails anymore from the mailing list but I'll >> follow gitlab. >> >> In reply to the patch proposed by Mario: >> https://patchwork.freedesktop.org/patch/486836/ >> With this patch applied on vanilla 5.18 kernel: >> - suspend still fails; >> - after suspend attempt, the screen comes back with only the cursor; >> - switching to a console let me get the following dmesg file. >> >> CC >> >> Le lundi 23 mai 2022, 15:02:53 CEST Christian Casteyde a écrit : >>> Hello >>> >>> I've checked with 5.18 the problem is still there. >>> Interestingly, I tried to revert the commit but it was rejected because of >>> >>> the change in the test from: >>> if (!adev->in_s0ix) >>> >>> to: >>> if (amdgpu_acpi_should_gpu_reset(adev)) >>> >>> in amdgpu_pmops_suspend. >>> >>> I fixed the rejection, keeping shoud_gpu_reset, but it still fails. >>> Then I changed to restore test of in_s0ix as it was in 5.17, and it works. >>> I tried with a call to amd_gpu_asic_reset without testing at all in_s0ix, >>> it works. >>> >>> Therefore, my APU wants a reset in amdgpu_pmops_suspend. >>> >>> By curiosity, I tried to do the reset in amdgpu_pmops_suspend_noirq as was >>> intended in 5.18 original code, commenting out the test of >>> amdgpu_acpi_should_gpu_reset(adev) (since this APU wants a reset). >>> This does not work, I got the Fence timeout errors or freezes. >>> >>> If I leave noirq function unchanged (original 5.18 code), and just add a >>> reset in suspend() as was done in 5.17, it works. >>> >>> Therefore, my GPU does NOT want to be reset in noirq, the reset must be in >>> suspend. >>> >>> In other words, I modified amdgpu_pmops_suspend (partial revert) like this >>> and this works on my laptop: >>> >>> static int amdgpu_pmops_suspend(struct device *dev) >>> { >>> >>> struct drm_device *drm_dev = dev_get_drvdata(dev); >>> struct amdgpu_device *adev = drm_to_adev(drm_dev); >>> >>> + int r; >>> >>> if (amdgpu_acpi_is_s0ix_active(adev)) >>> >>> adev->in_s0ix = true; >>> >>> else >>> >>> adev->in_s3 = true; >>> >>> - return amdgpu_device_suspend(drm_dev, true); >>> + r = amdgpu_device_suspend(drm_dev, true); >>> + if (r) >>> + return r; >>> + if (!adev->in_s0ix) >>> + return amdgpu_asic_reset(adev); >>> >>> return 0; >>> >>> } >>> >>> static int amdgpu_pmops_suspend_noirq(struct device *dev) >>> { >>> >>> struct drm_device *drm_dev = dev_get_drvdata(dev); >>> struct amdgpu_device *adev = drm_to_adev(drm_dev); >>> >>> if (amdgpu_acpi_should_gpu_reset(adev)) >>> >>> return amdgpu_asic_reset(adev); >>> >>> return 0; >>> >>> } >>> >>> I don't know if other APU want a reset, in the same context, and how to >>> differentiate all the cases, so I cannot go further, but I can test >>> patches >>> if needed. >>> >>> CC >>> >>> Le mercredi 18 mai 2022, 08:37:27 CEST Thorsten Leemhuis a écrit : >>>> On 18.05.22 07:54, Kai-Heng Feng wrote: >>>>> On Wed, May 18, 2022 at 1:52 PM Thorsten Leemhuis >>>>> >>>>> <regressions@xxxxxxxxxxxxx> wrote: >>>>>> On 17.05.22 19:37, casteyde.christian@xxxxxxx wrote: >>>>>>> I've tryied to revert the offending commit on 5.18-rc7 (887f75cfd0da >>>>>>> ("drm/amdgpu: Ensure HDA function is suspended before ASIC reset"), >>>>>>> and >>>>>>> the problem disappears so it's really this commit that breaks. >>>>>> >>>>>> In that case I'll update the regzbot status to make sure it's visible >>>>>> as >>>>>> regression introduced in the 5.18 cycle: >>>>>> >>>>>> #regzbot introduced: 887f75cfd0da >>>>>> >>>>>> BTW: obviously would be nice to get this fixed before 5.18 is >>>>>> released >>>>>> (which might already happen on Sunday), especially as the culprit >>>>>> apparently was already backported to stable, but I guess that won't >>>>>> be >>>>>> easy... >>>>>> >>>>>> Which made me wondering: is reverting the culprit temporarily in >>>>>> mainline (and reapplying it later with a fix) a option here? >>>>> >>>>> It's too soon to call it's the culprit. >>>> >>>> Well, sure, the root-cause might be somewhere else. But from the point >>>> of kernel regressions (and tracking them) it's the culprit, as that's >>>> the change that triggers the misbehavior. And that's how Linus >>>> approaches these things as well when it comes to reverting to fix >>>> regressions -- and he even might... >>>> >>>>> The suspend on the system >>>>> doesn't work properly at the first place. >>>> >>>> ...ignore things like this, as long as a revert is unlikely to cause >>>> more damage than good. >>>> >>>> Ciao. Thorsten >>>> >>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' >>>>>> hat) >>>>>> >>>>>> P.S.: As the Linux kernel's regression tracker I deal with a lot of >>>>>> reports and sometimes miss something important when writing mails >>>>>> like >>>>>> this. If that's the case here, don't hesitate to tell me in a public >>>>>> reply, it's in everyone's interest to set the public record straight. > > > > >