On 2/25/2022 08:36, Tvrtko Ursulin wrote:
On 24/02/2022 20:02, John Harrison wrote:
On 2/23/2022 04:00, Tvrtko Ursulin wrote:
On 23/02/2022 02:22, John Harrison wrote:
On 2/22/2022 01:53, Tvrtko Ursulin wrote:
On 18/02/2022 21:33, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>
Compute workloads are inherently not pre-emptible on current
hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the
heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset
and so
is much more destructive. Instead, just bump the pre-emption timeout
Can we have a feature request to allow asking GuC for an engine
reset?
For what purpose?
To allow "stopped heartbeat" to reset the engine, however..
GuC manages the scheduling of contexts across engines. With virtual
engines, the KMD has no knowledge of which engine a context might
be executing on. Even without virtual engines, the KMD still has no
knowledge of which context is currently executing on any given
engine at any given time.
There is a reason why hang detection should be left to the entity
that is doing the scheduling. Any other entity is second guessing
at best.
The reason for keeping the heartbeat around even when GuC
submission is enabled is for the case where the KMD/GuC have got
out of sync with either other somehow or GuC itself has just
crashed. I.e. when no submission at all is working and we need to
reset the GuC itself and start over.
.. I wasn't really up to speed to know/remember heartbeats are
nerfed already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by
GuC because GuC handles the scheduling. You can't do the former if
you aren't doing the latter. However, the heartbeat is still present
and is still the watchdog by which engine resets are triggered. As
per the rest of the submission process, the hang detection and
recovery is split between i915 and GuC.
I meant that "stopped heartbeat on engine XXX" can only do a full GPU
reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 is
not handling the recovery part of the process.
intel_gt_handle_error(engine->gt, engine->mask,
I915_ERROR_CAPTURE,
"stopped heartbeat on %s",
engine->name);
intel_gt_handle_error:
/*
* Try engine reset when available. We fall back to full reset if
* single reset fails.
*/
if (!intel_uc_uses_guc_submission(>->uc) &&
intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
local_bh_disable();
for_each_engine_masked(engine, gt, engine_mask, tmp) {
You said "However, the heartbeat is still present and is still the
watchdog by which engine resets are triggered", now I don't know what
you meant by this. It actually triggers a single engine reset in GuC
mode? Where in code does that happen if this block above shows it not
taking the engine reset path?
i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine
The fundamental process is exactly the same as in execlist mode. It's
just that the above blocks of code (calls to intel_gt_handle_error and
such) are now inside the GuC not i915.
Without the heartbeat going ping, there would be no context switching
and thus no pre-emption, no pre-emption timeout and so no hang and reset
recovery. And GuC cannot sent pulses by itself - it has no ability to
generate context workloads. So we need i915 to send the pings and to
gradually raise their priority. But the back half of the heartbeat code
is now inside the GuC. It will simply never reach the i915 side timeout
if GuC is doing the recovery (unless the heartbeat's final period is too
short). We should only reach the i915 side timeout if GuC itself is
toast. At which point we need the full GT reset to recover the GuC.
John.
I am not sure it was the best way since full reset penalizes
everyone for one hanging engine. Better question would be why leave
heartbeats around to start with with GuC? If you want to use it to
health check GuC, as you say, maybe just send a CT message and
expect replies? Then full reset would make sense. It also achieves
the goal of not seconding guessing the submission backend you raise.
Adding yet another hang detection mechanism is yet more complication
and is unnecessary when we already have one that works well enough.
As above, the heartbeat is still required for sending the pulses that
cause pre-emptions and so let GuC detect hangs. It also provides a
fallback against a dead GuC by default. So why invent yet another wheel?
Lets first clarify the previous block to make sure there aren't any
misunderstandings there.
Regards,
Tvrtko
Like it is now, and the need for this series demonstrates it, the
whole thing has a pretty poor "impedance" match. Not even sure what
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c -
why is that not in intel_gt_handle_error at least? Why is hearbeat
code special and other callers of intel_gt_handle_error don't need it?
There is no guilty context if the reset was triggered via debugfs or
similar. And as stated ad nauseam, i915 is no longer handling the
scheduling and so cannot make assumptions about what is or is not
running on the hardware at any given time. And obviously, if the
reset initiated by GuC itself then i915 should not be second guessing
the guilty context as the GuC notification has already told us who
was responsible.
And to be clear, the 'poor impedance match' is purely because we
don't have mid-thread pre-emption and so need a stupidly huge timeout
on compute capable engines. Whereas, we don't want a total heatbeat
timeout of a minute or more. That is the impedance mis-match. If the
640ms was acceptable for RCS then none of this hacky timeout
algorithm mush would be needed.
John.
Regards,
Tvrtko