On 23/02/2022 02:22, John Harrison wrote:
On 2/22/2022 01:53, Tvrtko Ursulin wrote:
On 18/02/2022 21:33, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>
Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout
Can we have a feature request to allow asking GuC for an engine reset?
For what purpose?
To allow "stopped heartbeat" to reset the engine, however..
GuC manages the scheduling of contexts across engines. With virtual
engines, the KMD has no knowledge of which engine a context might be
executing on. Even without virtual engines, the KMD still has no
knowledge of which context is currently executing on any given engine at
any given time.
There is a reason why hang detection should be left to the entity that
is doing the scheduling. Any other entity is second guessing at best.
The reason for keeping the heartbeat around even when GuC submission is
enabled is for the case where the KMD/GuC have got out of sync with
either other somehow or GuC itself has just crashed. I.e. when no
submission at all is working and we need to reset the GuC itself and
start over.
.. I wasn't really up to speed to know/remember heartbeats are nerfed
already in GuC mode.
I am not sure it was the best way since full reset penalizes everyone
for one hanging engine. Better question would be why leave heartbeats
around to start with with GuC? If you want to use it to health check
GuC, as you say, maybe just send a CT message and expect replies? Then
full reset would make sense. It also achieves the goal of not seconding
guessing the submission backend you raise.
Like it is now, and the need for this series demonstrates it, the whole
thing has a pretty poor "impedance" match. Not even sure what
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why is
that not in intel_gt_handle_error at least? Why is hearbeat code special
and other callers of intel_gt_handle_error don't need it?
Regards,
Tvrtko