Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts

John Harrison <john.c.harrison@xxxxxxxxx> · Thu, 3 Mar 2022 11:09:22 -0800

On 3/3/2022 01:55, Tvrtko Ursulin wrote:
On 02/03/2022 17:55, John Harrison wrote:

I was assuming 2.5s tP is enough and basing all calculation on that. 
Heartbeat or timeslicing regardless. I thought we established 
neither of us knows how long is enough.

Are you now saying 2.5s is definitely not enough? How is that usable 
for a default out of the box desktop?
Show me your proof that 2.5s is enough.

7.5s is what we have been using internally for a very long time. It 
has approval from all relevant parties. If you wish to pick a new 
random number then please provide data to back it up along with buy 
in from all UMD teams and project management.

And upstream disabled preemption has acks from compute. "Internally" 
is as far away from out of the box desktop experiences we have been 
arguing about. In fact you have argued compute disables the hearbeat 
anyway.

Lets jump to the end of this reply please for actions.

And I don't have a problem with extending the last pulse. It is 
fundamentally correct to do regardless of the backend. I just raised 
the question of whether to extend all heartbeats to account for 
preemption (and scheduling delays). (What is the point of bumping 
their priority and re-scheduling if we didn't give enough time to 
the engine to react? So opposite of the question you raise.)
The point is that it we are giving enough time to react. Raising the 
priority of a pre-emption that has already been triggered will have 
no effect. So as long as the total time from when the pre-emption is 
triggered (prio becomes sufficiently high) to the point when the 
reset is decided is longer than the pre-emption timeout then it 
works. Given that, it is unnecessary to increase the intermediate 
periods. It has no advantage and has the disadvantage of making the 
total time unreasonably long.

So again, what is the point of making every period longer? What 
benefit does it *actually* give?

Less special casing and pointless prio bumps ahead of giving time to 
engine to even react. You wouldn't have to have the last pulse 2 * tP 
but normal tH + tP. So again, it is nicer for me to derive all 
heartbeat pulses from the same input parameters.

The whole "it is very long" argument is IMO moot because now proposed 
7.5s preempt period is I suspect wholly impractical for desktop. 
Combined with the argument that real compute disables heartbeats 
anyway even extra so.
The whole thing is totally fubar already. Right now pre-emption is 
totally disabled. So you are currently waiting for the entire heartbeat 
sequence to complete and then nuking the entire machine. So arguing that 
7.5s is too long is pointless. 7.5s is a big improvement over what is 
currently enabled.

And 'nice' would be having hardware that worked in a sensible manner. 
There is no nice here. There is only 'what is the least worst option'. 
And the least worst option for an end user is a long pre-emption timeout 
with a not massively long heartbeat. If that means a very slight 
complication in the heartbeat code, that is a trivial concern.

Fine. "tP(RCS) = 7500;" can I merge the patch now?
I could live with setting preempt timeout to 7.5s. The downside is 
slower time to restoring frozen desktops. Worst case today 5 * 2.5s, 
with changes 4 * 2.5s + 2 * 7.5s; so from 12.5s to 25s, doubling..
But that is worst case scenario (when something much more severe than an 
application hang has occurred). Regular case would be second heartbeat 
period + pre-emption timeout and an engine only reset not a full GT 
reset. So it's still better than what we have at present.

Actions:

1)
Get a number from compute/OpenCL people for what they say is minimum 
preempt timeout for default out of the box Linux desktop experience.
That would be the one that has been agreed upon by both linux software 
arch and all UMD teams and has been in use for the past year or more in 
the internal tree.

This does not mean them running some tests and can't be bothered to 
setup up the machine for the extreme use cases, but workloads average 
users can realistically be expected to run.

Say for instance some image manipulation software which is OpenCL 
accelerated or similar. How long unpreemptable sections are expected 
there. Or similar. I am not familiar what all OpenCL accelerated use 
cases there are on Linux.

And this number should be purely about minimum preempt timeout, not 
considering heartbeats. This is because preempt timeout may kick in 
sooner than stopped heartbeat if the user workload is low priority.

And driver is simply hosed in the intervening six months or more that it 
takes for the right people to find the time to do this.

Right now, it is broken. This patch set improves things. Actual numbers 
can be refined later as/when some random use case that we hadn't 
previously thought of pops up. But not fixing the basic problem at all 
until we have an absolutely perfect for all parties solution is 
pointless. Not least because there is no perfect solution. No matter 
what number you pick it is going to be wrong for someone.

2.5s, 7.5s, X.Ys, I really don't care. 2.5s is a number you seem to have 
picked out of the air totally at random, or maybe based on it being the 
heartbeat period (except that you keep arguing that basing tP on tH is 
wrong). 7.5s is a number that has been in active use for a lot of 
testing for quite some time - KMD CI, UMD CI, E2E, etc. But either way, 
the initial number is almost irrelevant as long as it is not zero. So 
can we please just get something merged now as a starting point?

2)
Commit message should explain the effect on the worst case time until 
engine reset.

3)
OpenCL/compute should ack the change publicly as well since they acked 
the disabling of preemption.
This patch set has already been publicly acked by the compute team. See 
the 'acked-by' tag.

4)
I really want overflows_type in the first patch.
In the final GuC assignment? Only if it is a BUG_ON. If we get a failure 
there it is an internal driver error and cannot be corrected for. It is 
too late for any plausible range check action.

And if you mean in the the actual helper function with the rest of the 
clamping then you are bleeding internal GuC API structure details into 
non-GuC code. Plus the test would be right next to the 'if (size < 
OFFICIAL_GUC_RANGE_LIMIT)' test which just looks dumb as well as being 
redundant duplication - "if ((value < GUC_LIMIT) && (value < 
NO_WE_REALLY_MEAN_IT_GUC_LIMIT))". And putting it inside the GuC limit 
definition looks even worse "#define LIMIT min(MAX_U32, 100*1000) /* 
because the developer doesn't know how big a u32 is */".

John.

My position is that with the above satisfied it is okay to merge.

Regards,

Tvrtko