Re: [PATCH v4 3/3] drm/i915/gt: Timeout when waiting for idle in suspending

Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> · Tue, 14 Nov 2023 12:36:43 -0500

On Tue, Nov 14, 2023 at 05:27:18PM +0000, Tvrtko Ursulin wrote:
> 
> On 13/11/2023 17:57, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-10-25 at 13:58 +0100, Tvrtko Ursulin wrote:
> > > On 04/10/2023 18:59, Teres Alexis, Alan Previn wrote:
> > > > On Thu, 2023-09-28 at 13:46 +0100, Tvrtko Ursulin wrote:
> > > > > On 27/09/2023 17:36, Teres Alexis, Alan Previn wrote:
> > alan:snip
> > > > > It is not possible to wait for lost G2H in something like
> > > > > intel_uc_suspend() and simply declare "bad things happened" if it times
> > > > > out there, and forcibly clean it all up? (Which would include releasing
> > > > > all the abandoned pm refs, so this patch wouldn't be needed.)
> > > > > 
> > > > alan: I'm not sure if intel_uc_suspend should be held up by gt-level wakeref
> > > > check unless huc/guc/gsc-uc are the only ones ever taking a gt wakeref.
> > > > 
> > > > As we already know, what we do know from a uc-perspective:
> > > > -  ensure the outstanding guc related workers is flushed which we didnt before
> > > > (addressed by patch #1).
> > > > - any further late H2G-SchedDisable is not leaking wakerefs when calling H2G
> > > > and not realizing it failed (addressed by patch #2).
> > > > - (we already), "forcibly clean it all up" at the end of the intel_uc_suspend
> > > > when we do the guc reset and cleanup all guc-ids. (pre-existing upstream code)
> > > > - we previously didnt have a coherrent guarantee that "this is the end" i.e. no
> > > > more new request after intel_uc_suspend. I mean by code logic, we thought we did
> > > > (thats why intel_uc_suspend ends wth a guc reset), but we now know otherwise.
> > > > So we that fix by adding the additional rcu_barrier (also part of patch #2).
> > > 
> > > It is not clear to me from the above if that includes cleaning up the
> > > outstanding CT replies or no. But anyway..
> > alan: Apologies, i should have made it clearer by citing the actual function
> > name calling out the steps of interest: So if you look at the function
> > "intel_gt_suspend_late", it calls "intel_uc_suspend" which in turn calls
> > "intel_guc_suspend" which does a soft reset onto guc firmware - so after that
> > we can assume all outstanding G2Hs are done. Going back up the stack,
> > intel_gt_suspend_late finally calls gt_sanitize which calls "intel_uc_reset_prepare"
> > which calls "intel_guc_submission_reset_prepare" which calls
> > "scrub_guc_desc_for_outstanding_g2h" which does what we are looking for for all
> > types of outstanding G2H. As per prior review comments, besides closing the race
> > condition, we were missing an rcu_barrier (which caused new requests to come in way
> > late). So we have resolved both in Patch #2.
> 
> Cool, I read this as all known bugs are fixed by first two patches.
> 
> > > > That said, patch-3 is NOT fixing a bug in guc -its about "if we ever have
> > > > a future racy gt-wakeref late-leak somewhere - no matter which subsystem
> > > > took it (guc is not the only subsystem taking gt wakerefs), we at
> > > > least don't hang forever in this code. Ofc, based on that, even without
> > > > patch-3 i am confident the issue is resolved anyway.
> > > > So we could just drop patch-3 is you prefer?
> > > 
> > > .. given this it does sound to me that if you are confident patch 3
> > > isn't fixing anything today that it should be dropped.
> > alan: I won't say its NOT fixing anything - i am saying it's not fixing
> > this specific bug where we have the outstanding G2H from a context destruction
> > operation that got dropped. I am okay with dropping this patch in the next rev
> > but shall post a separate stand alone version of Patch3 - because I believe
> > all other i915 subsystems that take runtime-pm's DO NOT wait forever when entering
> > suspend - but GT is doing this. This means if there ever was a bug introduced,
> > it would require serial port or ramoops collection to debug. So i think such a
> > patch, despite not fixing this specific bug will be very helpful for debuggability
> > of future issues. After all, its better to fail our suspend when we have a bug
> > rather than to hang the kernel forever.
> 
> Issue I have is that I am not seeing how it fails the suspend when nothing
> is passed out from *void* wait_suspend(gt). To me it looks the suspend just
> carries on. And if so, it sounds dangerous to allow it to do that with any
> future/unknown bugs in the suspend sequence. Existing timeout wedges the GPU
> (and all that entails). New code just says "meh I'll just carry on
> regardless".

That's a good point.
Well, current code is already bad and buggy on suspend-resume. We could get
suspend stuck forever without any clue of what happened.
At least the proposed patch add some gt_warn. But, yes, the right thing
to do should be entirely abort the suspend in case of timeout, besides the
gt_warn.

> 
> Regards,
> 
> Tvrtko
>