Re: [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost

"Dixit, Ashutosh" <ashutosh.dixit@xxxxxxxxx> · Wed, 08 Jun 2022 10:39:41 -0700

On Tue, 07 Jun 2022 16:15:19 -0700, John Harrison wrote:
>
> On 6/7/2022 15:29, Dixit, Ashutosh wrote:
> > On Sat, 14 May 2022 23:05:06 -0700, Vinay Belgaumkar wrote:
> >> SLPC min/max frequency updates require H2G calls. We are seeing
> >> timeouts when GuC channel is backed up and it is unable to respond
> >> in a timely fashion causing warnings and affecting CI.
> >>
> >> This is seen when waitboosting happens during a stress test.
> >> this patch updates the waitboost path to use a non-blocking
> >> H2G call instead, which returns as soon as the message is
> >> successfully transmitted.
> > Overall I think this patch is trying to paper over problems in the blocking
> > H2G CT interface (specifically the 1 second timeout in
> > wait_for_ct_request_update()). So I think we should address that problem in
> > the interface directly rather than having each client (SLPC and any future
> > client) work around the problem. Following points:
> >
> > 1. This patch seems to assume that it is 'ok' to ignore the return code
> >     from FW for a waitboost request (arguing waitboost is best effort so
> >     it's ok to 'fire and forget'). But the return code is still useful
> >     e.g. in cases where we see performance issues and want to go back and
> >     investigate if FW rejected any waitboost requests.
>
> You still get errors reported in the GuC log. Indeed, some errors (or at
> least error reasons) are only visible in the log not in the return code.

OK, so we at least have this method for debug available.

> > 2. We are already seeing that a 1 second timeout is not sufficient. So why
> >     not simply increase that timeout?
> >
> > 3. In fact if we are saying that the CT interface is a "reliable" interface
> >     (implying no message loss), to ensure reliability that timeout should
> >     not simply be increased, it should be made "infinite" (in quotes).
> >
> > 4. Maybe it would have been best to not have a "blocking" H2G interface at
> >     all (with the wait in wait_for_ct_request_update()). Just have an
> >     asynchronous interface (which mirrors the actual interface between FW
> >     and i915) in which clients register callbacks which are invoked when FW
> >     responds. If this is too big a change we can probably continue with the
> >     current blocking interface after increasing the timeout as mentioned
> >     above.
> >
> > 5. Finally, the waitboost request is just the most likely to get stuck at
> >     the back of a full CT queue since it happens during normal
> >     operation. Actually any request, say one initiated from sysfs, can also
> >     get similarly stuck at the back of a full queue. So any solution should
> >     also address that situation (where the return code is needed and
> >     similarly for a future client of the "blocking" (REQUEST/RESPONSE)
> >     interface).
> The blocking interface is only intended for init time operations, not
> runtime.

In that case we should probably have code to enforce this in i915.

> Stuff where the operation is meant to be synchronous and the KMD
> should not proceed until it has an ack back from the GuC that the update
> has taken place. All runtime operations are expected to be asynchronous. If
> a response is required, then it should be sent via an async
> callback. E.g. context de-registration is a 'fire and forget' H2G call but
> gets a 'deregistration complete' G2H notification when it is safe for the
> KMD to free up the associated storage.

At present all GuC interactions in intel_guc_slpc.c (in i915) do *not*
follow this. They use the REQUEST/RESPONSE FW interface which is pushed
through the blocking H2G CT interface in i915. If we are serious about this
this needs a GuC FW change to use bi-directional EVENT's used in the
asynchronous interface (with corresponding changes in intel_guc_slpc.c).

> There is an 'errors only' H2G mechanism. That will not send an ack back in
> the case of a successful H2G but will send back an error notification in
> the case of a failure. All async H2Gs should really be using that
> mechanism. I think Michal W did post a patch for it and I was meant to be
> reviewing it but it dropped of my radar due to other higher priorities.

These I believe are referred to as FAST_REQUEST's in GuC FW. That success
is not communicated back to the KMD might be an issue in cases where KMD
needs to know whether a particular operation was successful (such as
for operations initiated via sysfs).

Thanks.
--
Ashutosh