On Tue, 07 Jun 2022 16:15:19 -0700, John Harrison wrote: > > On 6/7/2022 15:29, Dixit, Ashutosh wrote: > > On Sat, 14 May 2022 23:05:06 -0700, Vinay Belgaumkar wrote: > >> SLPC min/max frequency updates require H2G calls. We are seeing > >> timeouts when GuC channel is backed up and it is unable to respond > >> in a timely fashion causing warnings and affecting CI. > >> > >> This is seen when waitboosting happens during a stress test. > >> this patch updates the waitboost path to use a non-blocking > >> H2G call instead, which returns as soon as the message is > >> successfully transmitted. > > Overall I think this patch is trying to paper over problems in the blocking > > H2G CT interface (specifically the 1 second timeout in > > wait_for_ct_request_update()). So I think we should address that problem in > > the interface directly rather than having each client (SLPC and any future > > client) work around the problem. Following points: > > > > 1. This patch seems to assume that it is 'ok' to ignore the return code > > from FW for a waitboost request (arguing waitboost is best effort so > > it's ok to 'fire and forget'). But the return code is still useful > > e.g. in cases where we see performance issues and want to go back and > > investigate if FW rejected any waitboost requests. > > You still get errors reported in the GuC log. Indeed, some errors (or at > least error reasons) are only visible in the log not in the return code. OK, so we at least have this method for debug available. > > 2. We are already seeing that a 1 second timeout is not sufficient. So why > > not simply increase that timeout? > > > > 3. In fact if we are saying that the CT interface is a "reliable" interface > > (implying no message loss), to ensure reliability that timeout should > > not simply be increased, it should be made "infinite" (in quotes). > > > > 4. Maybe it would have been best to not have a "blocking" H2G interface at > > all (with the wait in wait_for_ct_request_update()). Just have an > > asynchronous interface (which mirrors the actual interface between FW > > and i915) in which clients register callbacks which are invoked when FW > > responds. If this is too big a change we can probably continue with the > > current blocking interface after increasing the timeout as mentioned > > above. > > > > 5. Finally, the waitboost request is just the most likely to get stuck at > > the back of a full CT queue since it happens during normal > > operation. Actually any request, say one initiated from sysfs, can also > > get similarly stuck at the back of a full queue. So any solution should > > also address that situation (where the return code is needed and > > similarly for a future client of the "blocking" (REQUEST/RESPONSE) > > interface). > The blocking interface is only intended for init time operations, not > runtime. In that case we should probably have code to enforce this in i915. > Stuff where the operation is meant to be synchronous and the KMD > should not proceed until it has an ack back from the GuC that the update > has taken place. All runtime operations are expected to be asynchronous. If > a response is required, then it should be sent via an async > callback. E.g. context de-registration is a 'fire and forget' H2G call but > gets a 'deregistration complete' G2H notification when it is safe for the > KMD to free up the associated storage. At present all GuC interactions in intel_guc_slpc.c (in i915) do *not* follow this. They use the REQUEST/RESPONSE FW interface which is pushed through the blocking H2G CT interface in i915. If we are serious about this this needs a GuC FW change to use bi-directional EVENT's used in the asynchronous interface (with corresponding changes in intel_guc_slpc.c). > There is an 'errors only' H2G mechanism. That will not send an ack back in > the case of a successful H2G but will send back an error notification in > the case of a failure. All async H2Gs should really be using that > mechanism. I think Michal W did post a patch for it and I was meant to be > reviewing it but it dropped of my radar due to other higher priorities. These I believe are referred to as FAST_REQUEST's in GuC FW. That success is not communicated back to the KMD might be an issue in cases where KMD needs to know whether a particular operation was successful (such as for operations initiated via sysfs). Thanks. -- Ashutosh