On 03/10/2022 08:53, Tvrtko Ursulin wrote:
On 30/09/2022 18:44, John Harrison wrote:
On 9/30/2022 02:22, Tvrtko Ursulin wrote:
On 29/09/2022 17:21, John Harrison wrote:
On 9/29/2022 00:42, Tvrtko Ursulin wrote:
On 29/09/2022 03:18, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>
Compute workloads are inherently not pre-emptible for long periods on
current hardware. As a workaround for this, the pre-emption timeout
for compute capable engines was disabled. This is undesirable with
GuC
submission as it prevents per engine reset of hung contexts. Hence
the
next patch will re-enable the timeout but bumped up by an order of
magnitude.
However, the heartbeat might not respect that. Depending upon current
activity, a pre-emption to the heartbeat pulse might not even be
attempted until the last heartbeat period. Which means that only one
period is granted for the pre-emption to occur. With the aforesaid
bump, the pre-emption timeout could be significantly larger than this
heartbeat period.
So adjust the heartbeat code to take the pre-emption timeout into
account. When it reaches the final (high priority) period, it now
ensures the delay before hitting reset is bigger than the pre-emption
timeout.
v2: Fix for selftests which adjust the heartbeat period manually.
Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx>
---
.../gpu/drm/i915/gt/intel_engine_heartbeat.c | 19
+++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index a3698f611f457..823a790a0e2ae 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -22,9 +22,28 @@
static bool next_heartbeat(struct intel_engine_cs *engine)
{
+ struct i915_request *rq;
long delay;
delay = READ_ONCE(engine->props.heartbeat_interval_ms);
+
+ rq = engine->heartbeat.systole;
+
+ if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER &&
+ delay == engine->defaults.heartbeat_interval_ms) {
Maybe I forgot but what is the reason for the check against the
default heartbeat interval?
That's the 'v2: fix for selftests that manually adjust the
heartbeat'. If something (or someone) has explicitly set an override
of the heartbeat then it has to be assumed that they know what they
are doing, and if things don't work any more that's their problem.
But if we don't respect their override then they won't get the
timings they expect and the selftest will fail.
Isn't this a bit too strict for the non-selftest case? If the new
concept is extending the last pulse to guarantee preemption, then I
think we could allow tweaking of the heartbeat period. Like what if
user wants 1s, or 10s instead of 2.5s - why would that need to break
the improvement from this patch?
Then the user is back to where they were before this patch.
In what ways selftests fail? Are they trying to guess time to reset
based on the hearbeat period set? If so perhaps add a helper to query
it based on the last pulse extension.
I don't recall. It was six months ago when I was actually working on
this. And right now I do not have the time to go back and re-run all
the testing and re-write a bunch of self tests with whole new helpers
and algorithms and whatever else might be necessary to polish this to
perfection. And in the meantime, all the existing issues are still
present - there is no range checking on any of this stuff, it is very
possible for a driver with default settings to break a legal workload
because the heartbeat and pre-emption are fighting with each other, we
don't even have per engine resets enabled, etc.
Maybe it could be even better with a follow up patch. Feel free to do
that. But as it stands, this patch set significantly improves the
situation without making anything worse.
As we seem to be in agreement that the check against default heartbeat
is a hack with only purpose to work around assumptions made by
selftests, then please file a Jira about removing it (this hack). Then
work can be assigned to someone to clean it up. With that done I would
agree the series is indeed an improvement and it would have my ack.
One more thing - put a comment in the code along the lines of
"FIXME/HACK: Work around selftests assumptions by only extending the
last heartbeat if the period is at default value". The the Jira can
associate to that comment.
Until that is resolve it may also be worth emitting a drm_notice if
heartbeat is changed via sysfs? Informing users the things will not work
as expected if they fiddle with it. Whether as a blanket warning or
checking first the 3-4x heartbeat vs preempt timeout value. That message
should then go away once the follow up work to fix the selftests is
done. See what the other reviewers will think.
Regards,
Tvrtko