Hello I've checked with 5.18 the problem is still there. Interestingly, I tried to revert the commit but it was rejected because of the change in the test from: if (!adev->in_s0ix) to: if (amdgpu_acpi_should_gpu_reset(adev)) in amdgpu_pmops_suspend. I fixed the rejection, keeping shoud_gpu_reset, but it still fails. Then I changed to restore test of in_s0ix as it was in 5.17, and it works. I tried with a call to amd_gpu_asic_reset without testing at all in_s0ix, it works. Therefore, my APU wants a reset in amdgpu_pmops_suspend. By curiosity, I tried to do the reset in amdgpu_pmops_suspend_noirq as was intended in 5.18 original code, commenting out the test of amdgpu_acpi_should_gpu_reset(adev) (since this APU wants a reset). This does not work, I got the Fence timeout errors or freezes. If I leave noirq function unchanged (original 5.18 code), and just add a reset in suspend() as was done in 5.17, it works. Therefore, my GPU does NOT want to be reset in noirq, the reset must be in suspend. In other words, I modified amdgpu_pmops_suspend (partial revert) like this and this works on my laptop: static int amdgpu_pmops_suspend(struct device *dev) { struct drm_device *drm_dev = dev_get_drvdata(dev); struct amdgpu_device *adev = drm_to_adev(drm_dev); + int r; if (amdgpu_acpi_is_s0ix_active(adev)) adev->in_s0ix = true; else adev->in_s3 = true; - return amdgpu_device_suspend(drm_dev, true); + r = amdgpu_device_suspend(drm_dev, true); + if (r) + return r; + if (!adev->in_s0ix) + return amdgpu_asic_reset(adev); return 0; } static int amdgpu_pmops_suspend_noirq(struct device *dev) { struct drm_device *drm_dev = dev_get_drvdata(dev); struct amdgpu_device *adev = drm_to_adev(drm_dev); if (amdgpu_acpi_should_gpu_reset(adev)) return amdgpu_asic_reset(adev); return 0; } I don't know if other APU want a reset, in the same context, and how to differentiate all the cases, so I cannot go further, but I can test patches if needed. CC Le mercredi 18 mai 2022, 08:37:27 CEST Thorsten Leemhuis a écrit : > On 18.05.22 07:54, Kai-Heng Feng wrote: > > On Wed, May 18, 2022 at 1:52 PM Thorsten Leemhuis > > > > <regressions@xxxxxxxxxxxxx> wrote: > >> On 17.05.22 19:37, casteyde.christian@xxxxxxx wrote: > >>> I've tryied to revert the offending commit on 5.18-rc7 (887f75cfd0da > >>> ("drm/amdgpu: Ensure HDA function is suspended before ASIC reset"), and > >>> the problem disappears so it's really this commit that breaks. > >> > >> In that case I'll update the regzbot status to make sure it's visible as > >> regression introduced in the 5.18 cycle: > >> > >> #regzbot introduced: 887f75cfd0da > >> > >> BTW: obviously would be nice to get this fixed before 5.18 is released > >> (which might already happen on Sunday), especially as the culprit > >> apparently was already backported to stable, but I guess that won't be > >> easy... > >> > >> Which made me wondering: is reverting the culprit temporarily in > >> mainline (and reapplying it later with a fix) a option here? > > > > It's too soon to call it's the culprit. > > Well, sure, the root-cause might be somewhere else. But from the point > of kernel regressions (and tracking them) it's the culprit, as that's > the change that triggers the misbehavior. And that's how Linus > approaches these things as well when it comes to reverting to fix > regressions -- and he even might... > > > The suspend on the system > > doesn't work properly at the first place. > > ...ignore things like this, as long as a revert is unlikely to cause > more damage than good. > > Ciao. Thorsten > > >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > >> > >> P.S.: As the Linux kernel's regression tracker I deal with a lot of > >> reports and sometimes miss something important when writing mails like > >> this. If that's the case here, don't hesitate to tell me in a public > >> reply, it's in everyone's interest to set the public record straight.