Re: [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 14/11/2023 17:37, Teres Alexis, Alan Previn wrote:
On Tue, 2023-11-14 at 17:27 +0000, Tvrtko Ursulin wrote:
On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
alan: snip

alan: I won't say its NOT fixing anything - i am saying it's not fixing
this specific bug where we have the outstanding G2H from a context destruction
operation that got dropped. I am okay with dropping this patch in the next rev
but shall post a separate stand alone version of Patch3 - because I believe
all other i915 subsystems that take runtime-pm's DO NOT wait forever when entering
suspend - but GT is doing this. This means if there ever was a bug introduced,
it would require serial port or ramoops collection to debug. So i think such a
patch, despite not fixing this specific bug will be very helpful for debuggability
of future issues. After all, its better to fail our suspend when we have a bug
rather than to hang the kernel forever.

Issue I have is that I am not seeing how it fails the suspend when
nothing is passed out from *void* wait_suspend(gt). To me it looks the
suspend just carries on. And if so, it sounds dangerous to allow it to
do that with any future/unknown bugs in the suspend sequence. Existing
timeout wedges the GPU (and all that entails). New code just says "meh
I'll just carry on regardless".

alan: So i did trace back the gt->wakeref before i posted these patches and
see that within these runtime get/put calls, i believe the first 'get' leads
to __intel_wakeref_get_first which calls intel_runtime_pm_get via rpm_get
helper and eventually executes a pm_runtime_get_sync(rpm->kdev); (hanging off
i915_device). (naturally there is a corresponding mirros for the '_put_last').
So this means the first-get and last-put lets the kernel know. Thats why when
i tested this patch, it did actually cause the suspend to abort from kernel side
and the kernel would print a message indicating i915 was the one that didnt
release all refs.

Ah that would be much better then.

Do you know if everything gets resumed/restored correctly in that case or we would need some additional work to maybe early exit from callers of wait_for_suspend()?

What I would also ask is to see if something like injecting a probing failure is feasible, so we can have this new timeout exit path constantly/regularly tested in CI.

Regards,

Tvrtko

alan: Anyways, i have pulled this patch out of rev6 of this series and created a
separate standalone patch for this patch #3 that we review independently.




[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux