On 11/29/2022 00:43, Tvrtko Ursulin wrote:
On 28/11/2022 16:52, Andrzej Hajda wrote:
In case context is exiting preempt_timeout_ms is used for timeout,
but since introduction of DRM_I915_PREEMPT_TIMEOUT_COMPUTE it increases
to 7.5 seconds. Heartbeat occurs earlier but it is still 2.5s.
Fixes: d7a8680ec9fb21 ("drm/i915: Improve long running compute w/a
for GuC submission")
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2410
Signed-off-by: Andrzej Hajda <andrzej.hajda@xxxxxxxxx>
---
Hi all,
I am not sure what is expected solution here, and if my patch does not
actually reverts intentions of patch d7a8680ec9fb21. Feel free to
propose
something better.
Other alternative would be to increase t/o in IGT tests, but I am not
sure
if this is good direction.
Is it the hack with the FIXME marker from 47daf84a8bfb ("drm/i915:
Make the heartbeat play nice with long pre-emption timeouts") that
actually breaks things? (If IGT modifies the preempt timeout the
heartbeat extension will not work as intended.)
If so, I think we agreed during review that was a weakness which needs
to be addressed, but I would need to re-read the old threads to
remember what was the plan. Regardless what it was it may be time is
now to continue with those improvements.
What is the actual issue? Just that closing contexts are taking forever
to actually close? That would be the whole point of the
'context_is_exiting' patch. Which I still totally disagree with.
If the context is being closed 'gracefully' and it is intended that it
should be allowed time to pre-empt without being killed via an engine
reset then the 7.5s delay is required. That is the officially agreed
upon timeout to allow compute capable contexts to reach a pre-emption
point before they should be killed. If an IGT is failing because it
enforces a shorter timeout then the IGT needs to be updated to account
for the fact that i915 has to support slow compute workloads.
If the context is being closed 'forcefully' and should be killed
immediately then you should be using the 'BANNED_PREEMPT_TIMEOUT' value
not the sysfs/config value.
Regarding heartbeats...
The heartbeat period is 2.5s. But there are up to five heartbeat periods
between the heartbeat starting and it declaring a hang. The patch you
mention also introduced a check on the pre-emption timeout when the last
period starts. If the pre-emption timeout is longer than the heartbeat
period then the last period is extended to guarantee that a full
pre-emption time is granted before declaring the hang.
Are you saying that a heartbeat timeout is occurring and killing the
system? Or are you just worried that something doesn't align correctly?
John.
Regards,
Tvrtko
Regards
Andrzej
---
drivers/gpu/drm/i915/gt/intel_execlists_submission.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 49a8f10d76c77b..bbbbcd9e00f947 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -1248,6 +1248,10 @@ static unsigned long
active_preempt_timeout(struct intel_engine_cs *engine,
/* Force a fast reset for terminated contexts (ignoring sysfs!) */
if (unlikely(intel_context_is_banned(rq->context) ||
bad_request(rq)))
return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS;
+ else if (unlikely(intel_context_is_exiting(rq->context)))
+ return min_t(typeof(unsigned long),
+ READ_ONCE(engine->props.preempt_timeout_ms),
+ CONFIG_DRM_I915_PREEMPT_TIMEOUT);
return READ_ONCE(engine->props.preempt_timeout_ms);
}