Re: [Intel-gfx] [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads

John Harrison <john.c.harrison@xxxxxxxxx> · Thu, 24 Feb 2022 12:02:54 -0800

On 2/23/2022 04:00, Tvrtko Ursulin wrote:
On 23/02/2022 02:22, John Harrison wrote:
On 2/22/2022 01:53, Tvrtko Ursulin wrote:
On 18/02/2022 21:33, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout

Can we have a feature request to allow asking GuC for an engine reset?
For what purpose?

To allow "stopped heartbeat" to reset the engine, however..

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might be 
executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given engine 
at any given time.

There is a reason why hang detection should be left to the entity 
that is doing the scheduling. Any other entity is second guessing at 
best.

The reason for keeping the heartbeat around even when GuC submission 
is enabled is for the case where the KMD/GuC have got out of sync 
with either other somehow or GuC itself has just crashed. I.e. when 
no submission at all is working and we need to reset the GuC itself 
and start over.

.. I wasn't really up to speed to know/remember heartbeats are nerfed 
already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by GuC 
because GuC handles the scheduling. You can't do the former if you 
aren't doing the latter. However, the heartbeat is still present and is 
still the watchdog by which engine resets are triggered. As per the rest 
of the submission process, the hang detection and recovery is split 
between i915 and GuC.

I am not sure it was the best way since full reset penalizes everyone 
for one hanging engine. Better question would be why leave heartbeats 
around to start with with GuC? If you want to use it to health check 
GuC, as you say, maybe just send a CT message and expect replies? Then 
full reset would make sense. It also achieves the goal of not 
seconding guessing the submission backend you raise.
Adding yet another hang detection mechanism is yet more complication and 
is unnecessary when we already have one that works well enough. As 
above, the heartbeat is still required for sending the pulses that cause 
pre-emptions and so let GuC detect hangs. It also provides a fallback 
against a dead GuC by default. So why invent yet another wheel?

Like it is now, and the need for this series demonstrates it, the 
whole thing has a pretty poor "impedance" match. Not even sure what 
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why 
is that not in intel_gt_handle_error at least? Why is hearbeat code 
special and other callers of intel_gt_handle_error don't need it?
There is no guilty context if the reset was triggered via debugfs or 
similar. And as stated ad nauseam, i915 is no longer handling the 
scheduling and so cannot make assumptions about what is or is not 
running on the hardware at any given time. And obviously, if the reset 
initiated by GuC itself then i915 should not be second guessing the 
guilty context as the GuC notification has already told us who was 
responsible.

And to be clear, the 'poor impedance match' is purely because we don't 
have mid-thread pre-emption and so need a stupidly huge timeout on 
compute capable engines. Whereas, we don't want a total heatbeat timeout 
of a minute or more. That is the impedance mis-match. If the 640ms was 
acceptable for RCS then none of this hacky timeout algorithm mush would 
be needed.

John.

Regards,

Tvrtko