Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts

John Harrison <john.c.harrison@xxxxxxxxx> · Tue, 1 Mar 2022 12:59:27 -0800

On 3/1/2022 04:09, Tvrtko Ursulin wrote:

I'll trim it a bit again..

On 28/02/2022 18:55, John Harrison wrote:
On 2/28/2022 09:12, Tvrtko Ursulin wrote:
On 25/02/2022 18:48, John Harrison wrote:
On 2/25/2022 10:14, Tvrtko Ursulin wrote:

[snip]

Your only objection is that ends up with too long total time 
before reset? Or something else as well?
An unnecessarily long total heartbeat timeout is the main 
objection. (2.5 + 12) * 5 = 72.5 seconds. That is a massive change 
from the current 12.5s.

If we are happy with that huge increase then fine. But I'm pretty 
sure you are going to get a lot more bug reports about hung systems 
not recovering. 10-20s is just about long enough for someone to 
wait before leaning on the power button of their machine. Over a 
minute is not. That kind of delay is going to cause support issues.

Sorry I wrote 12s, while you actually said tP * 12, so 7.68s, chosen 
just so it is longer than tH * 3?

And how do you keep coming up with factor of five? Isn't it four 
periods before "heartbeat stopped"? (Prio normal, hearbeat, barrier 
and then reset.)
Prio starts at low not normal.

Right, slipped my mind since I only keep seeing that one priority 
ladder block in intel_engine_heartbeat.c/heartbeat()..

From the point of view of user experience I agree reasonable 
responsiveness is needed before user "reaches for the power button".

In your proposal we are talking about 3 * 2.5s + 2 * 7.5s, so 22.5s.

Question of workloads.. what is the actual preempt timeout compute 
is happy with? And I don't mean compute setups with disabled 
hangcheck, which you say they want anyway, but if we target defaults 
for end users. Do we have some numbers on what they are likely to run?
Not that I have ever seen. This is all just finger in the air stuff. 
I don't recall if we invented the number and the compute people 
agreed with it or if they proposed the number to us.

Yeah me neither. And found nothing in my email archives. :(

Thinking about it today I don't see that disabled timeout is a 
practical default.

With it, if users have something un-preemptable to run (assuming prio 
normal), it would get killed after ~13s (5 * 2.5).

If we go for my scheme it gets killed in ~17.5s (3 * (2.5 + 2.5) + 2.5 
(third pulse triggers preempt timeout)).

And if we go for your scheme it gets killed in ~22.5s (4 * 2.5 + 2 * 3 
* 2.5).
Erm, that is not an apples to apples comparison. Your 17.5 is for an 
engine reset tripped by the pre-emption timeout, but your 22.5s is for a 
GT reset tripped by the heartbeat reaching the end and nuking the universe.

If you are saying that the first pulse at sufficient priority (third 
being normal prio) is what causes the reset because the system is 
working as expected and the pre-emption timeout trips the reset. In that 
case, you have two periods to get to normal prio plus one pre-emption 
timeout to trip the reset. I.e. (tH * 2) + tP.

Your scheme is then tH(actual) = tH(user) + tP, yes?
So pre-emption based reset is after ((tH(user) + tP) * 2) + tP => (3 * 
tP) + (2 * tH)
And GT based reset is after (tH(user) + tP) * 5 => (5 * tP) + (5 * tH)

My scheme is tH(actual) = tH(user) for first four, then max(tH(user), 
tP) for fifth.
So pre-emption based reset is after tH(user) * 2 + tP = > tP + (2 * tH);
And GT based reset is after (tH(user) * 4) + (max(tH(user), tP) * 1) => 
greater of ((4 * tH) + tP) or (5 * tH)

Either way your scheme is longer. With tH(user) = 2.5s, tP(RCS) = 7.5s, 
we get 27.5s for engine and 50s for GT versus my 12.5s for engine and 
17.5s for GT. With tP(RCS) = 2.5s, yours is 12.5s for engine and 25s for 
GT versus my 7.5s for engine and 12.5s for GT.

Plus, not sure why your calculations above are using 2.5 for tP? Are you 
still arguing that 7.5s is too long? That is a separate issue and not 
related to the heartbeat algorithms. tP must be long enough to allow 
'out of box OpenCL workloads to complete'. That doesn't just mean not 
being killed by the heartbeat, it also means not being killed by running 
two of them concurrently (or one plus desktop OpenGL rendering) and not 
having it killed by basic time slicing between the two contexts. The 
heartbeat is not involved in that process. That is purely the 
pre-emption timeout. And that is the fundamental reason why tP needs to 
be much larger on RCS/CCS.

If I did not confuse any calculation this time round, then the 
differences for default case for normal priority sound pretty 
immaterial to me.

What if we gave them a default of 2.5s? That would be 4 * (2.5s + 
2.5s) = 20s worst case until reset, comparable to your proposal. Are 
there realistic workloads which are non-preemptable for 2.5s? I am 
having hard time imagining someone would run them on a general 
purpose desktop since it would mean frozen and unusable UI anyway.

Advantage still being in my mind that there would be no fudging of 
timeouts during driver load and heartbeat periods depending on 
priority. To me it feels more plausible to account for preempt 
timeout in heartbeat pulses that to calculate preempt timeout to be 
longer than hearbeat pulses, just to avoid races between the two.
Except that when the user asks for a heartbeat period of 2.5s you are 
actually setting it to 5s. How is that not a major fudge that is 
totally disregarding the user's request?

This is indeed the core question. My thinking:

It is not defined in the heartbeat ABI that N pulses should do 
anything, just that they are periodic pulses which check the health of 
an engine.

If we view user priority as not under our control then we can say that 
any heartbeat pulse can trigger preempt timeout and we should let it 
do that.

From that it follows that it is justified to account for preempt 
timeout in the decision when to schedule heartbeat pulses and that it 
is legitimate to do it for all of them.
But it can be optimised to say that it doesn't matter if you bump the 
priority of a pulse before waiting for the pre-emption period to expire. 
If the pulse was already high enough prio then the pre-emption has 
already been triggered and bumping the prio has no effect. If was too 
low then waiting for longer has no benefit at all.

All that matters is that you don't hit the end stop and trigger the GT 
reset too early.

It also avoids the double reset problem, regardless of the backend and 
regardless of how the user configured the timeouts. Without the need 
to fudge them neither during driver load or during sysfs store.

User has configured that heartbeat pulses should be sent every N 
seconds, yes, but I think we are free to account for inherent hardware 
and software latencies in our calculations. Especially since other 
than flawed Gen12 RCS, other engines will be much closer to the 
configured period.

It is just the same as user asking for preempt timeout N and we say on 
driver load "oh no you won't get it". Same for heartbeats, they said 
2.5s, we said 2.5s + broken engine factor...
Why would you not get the pre-emption timeout? Because it is too large? 
That is a physical limitation (of the GuC firmware) not an override 
because we think we know better. If we can obey the user then we should 
do so.

I don't see a problem there. Nothing timing sensitive relies on the 
heartbeat interval nor we provided any guarantees.

That patch from Chris for instance AFAIR accounted for scheduling or 
context switch latencies. Because what is the point of sending further 
elevated priority pulses if we did not leave enough time to the engine 
to schedule them in, react with preemption, or signalling completion?

Persistence itself can stay. There are valid UMD use cases. It 
is just massively over complicated and doesn't work in all 
corner cases when not using execlist submission or on newer 
platforms. The simplification that is planned is to allow 
contexts to persist until the associated DRM master handle is 
closed. At that point, all contexts associated with that DRM 
handle are killed. That is what AMD and others apparently 
implement.

Okay, that goes against one recent IGT patch which appeared to 
work around something by moving the position of _context_ close.
No it does not. The context close is not the trigger. The trigger is 

Well patch says:
"""
The spin all test relied on context persistence unecessarily by 
trying
to destroy contexts while keeping spinners active.
The current implementation of context persistence in i915 can cause
failures with GuC enabled, and persistence is not needed for this 
test.

Moving intel_ctx_destroy after igt_spin_end.
"""

Implying moving context close to after spin end fixes things for 
GuC, not fd close.
That's because persistence is currently a big pile of poo and does 
not work in all the corner cases. The correct solution is to leave 
the IGT alone and just fix the implementation of persistence. 
However, the IGT update to not use the broken feature is a trivial 
test change (two lines?) whereas fixing the broken feature is a 
significant KMD re-work. It needs to be done but no-one currently 
has the time to do it. But trivially changing the test allows the 
test to work and test the features it is meant to be testing (which 
is not persistence).

Clear as mud. If the statement is that persistence cannot simply be 
removed, then for sure it cannot be said that anything can be fixed 
or unblocked by allowing some test to pass with GuC, by making them 
avoid using persistence (and not even explicitly with a context 
param). It implies persistence does not work with the GuC, which is 
then in contradiction with the statement that we cannot just remove 
persistence. I truly have no idea what the argument is here.
Persistence works in the right set of circumstances. Those 
circumstances do not involve dynamically changing heartbeat settings, 
platforms with dependent engines, etc. The correct fix is to leave 
the IGT test alone and fix the persistence implementation. However, 
that is not trivial and we have many other high priority holes still 
to plug. Whereas changing the IGT to not use a feature it is not 
intended to be testing anyway is a trivial change and gets us the 
test coverage of what that IGT is meant to be for.

It may be acceptable if someone is reviewing overall coverage and 
making sure all is not removed and so a missing ABI in GuC backend not 
swept under the carpet. That's my main concern. If it is acknowledged 
persistence is a needed ABI, then how can we upstream dependent engine 
support without making sure this ABI is respected? Removing it's use 
from random tests does not fill me with confidence that we are on top 
of this topic.
Maybe if it didn't take three+ weeks to get a trivial change merged then 
I might have time to work on the non-trivial tasks that are in the backlog.

John.

Regards,

Tvrtko