On 3/3/2022 01:55, Tvrtko Ursulin wrote:
On 02/03/2022 17:55, John Harrison wrote:
I was assuming 2.5s tP is enough and basing all calculation on that.
Heartbeat or timeslicing regardless. I thought we established
neither of us knows how long is enough.
Are you now saying 2.5s is definitely not enough? How is that usable
for a default out of the box desktop?
Show me your proof that 2.5s is enough.
7.5s is what we have been using internally for a very long time. It
has approval from all relevant parties. If you wish to pick a new
random number then please provide data to back it up along with buy
in from all UMD teams and project management.
And upstream disabled preemption has acks from compute. "Internally"
is as far away from out of the box desktop experiences we have been
arguing about. In fact you have argued compute disables the hearbeat
anyway.
Lets jump to the end of this reply please for actions.
And I don't have a problem with extending the last pulse. It is
fundamentally correct to do regardless of the backend. I just raised
the question of whether to extend all heartbeats to account for
preemption (and scheduling delays). (What is the point of bumping
their priority and re-scheduling if we didn't give enough time to
the engine to react? So opposite of the question you raise.)
The point is that it we are giving enough time to react. Raising the
priority of a pre-emption that has already been triggered will have
no effect. So as long as the total time from when the pre-emption is
triggered (prio becomes sufficiently high) to the point when the
reset is decided is longer than the pre-emption timeout then it
works. Given that, it is unnecessary to increase the intermediate
periods. It has no advantage and has the disadvantage of making the
total time unreasonably long.
So again, what is the point of making every period longer? What
benefit does it *actually* give?
Less special casing and pointless prio bumps ahead of giving time to
engine to even react. You wouldn't have to have the last pulse 2 * tP
but normal tH + tP. So again, it is nicer for me to derive all
heartbeat pulses from the same input parameters.
The whole "it is very long" argument is IMO moot because now proposed
7.5s preempt period is I suspect wholly impractical for desktop.
Combined with the argument that real compute disables heartbeats
anyway even extra so.
The whole thing is totally fubar already. Right now pre-emption is
totally disabled. So you are currently waiting for the entire heartbeat
sequence to complete and then nuking the entire machine. So arguing that
7.5s is too long is pointless. 7.5s is a big improvement over what is
currently enabled.
And 'nice' would be having hardware that worked in a sensible manner.
There is no nice here. There is only 'what is the least worst option'.
And the least worst option for an end user is a long pre-emption timeout
with a not massively long heartbeat. If that means a very slight
complication in the heartbeat code, that is a trivial concern.
Fine. "tP(RCS) = 7500;" can I merge the patch now?
I could live with setting preempt timeout to 7.5s. The downside is
slower time to restoring frozen desktops. Worst case today 5 * 2.5s,
with changes 4 * 2.5s + 2 * 7.5s; so from 12.5s to 25s, doubling..
But that is worst case scenario (when something much more severe than an
application hang has occurred). Regular case would be second heartbeat
period + pre-emption timeout and an engine only reset not a full GT
reset. So it's still better than what we have at present.
Actions:
1)
Get a number from compute/OpenCL people for what they say is minimum
preempt timeout for default out of the box Linux desktop experience.
That would be the one that has been agreed upon by both linux software
arch and all UMD teams and has been in use for the past year or more in
the internal tree.
This does not mean them running some tests and can't be bothered to
setup up the machine for the extreme use cases, but workloads average
users can realistically be expected to run.
Say for instance some image manipulation software which is OpenCL
accelerated or similar. How long unpreemptable sections are expected
there. Or similar. I am not familiar what all OpenCL accelerated use
cases there are on Linux.
And this number should be purely about minimum preempt timeout, not
considering heartbeats. This is because preempt timeout may kick in
sooner than stopped heartbeat if the user workload is low priority.
And driver is simply hosed in the intervening six months or more that it
takes for the right people to find the time to do this.
Right now, it is broken. This patch set improves things. Actual numbers
can be refined later as/when some random use case that we hadn't
previously thought of pops up. But not fixing the basic problem at all
until we have an absolutely perfect for all parties solution is
pointless. Not least because there is no perfect solution. No matter
what number you pick it is going to be wrong for someone.
2.5s, 7.5s, X.Ys, I really don't care. 2.5s is a number you seem to have
picked out of the air totally at random, or maybe based on it being the
heartbeat period (except that you keep arguing that basing tP on tH is
wrong). 7.5s is a number that has been in active use for a lot of
testing for quite some time - KMD CI, UMD CI, E2E, etc. But either way,
the initial number is almost irrelevant as long as it is not zero. So
can we please just get something merged now as a starting point?
2)
Commit message should explain the effect on the worst case time until
engine reset.
3)
OpenCL/compute should ack the change publicly as well since they acked
the disabling of preemption.
This patch set has already been publicly acked by the compute team. See
the 'acked-by' tag.
4)
I really want overflows_type in the first patch.
In the final GuC assignment? Only if it is a BUG_ON. If we get a failure
there it is an internal driver error and cannot be corrected for. It is
too late for any plausible range check action.
And if you mean in the the actual helper function with the rest of the
clamping then you are bleeding internal GuC API structure details into
non-GuC code. Plus the test would be right next to the 'if (size <
OFFICIAL_GUC_RANGE_LIMIT)' test which just looks dumb as well as being
redundant duplication - "if ((value < GUC_LIMIT) && (value <
NO_WE_REALLY_MEAN_IT_GUC_LIMIT))". And putting it inside the GuC limit
definition looks even worse "#define LIMIT min(MAX_U32, 100*1000) /*
because the developer doesn't know how big a u32 is */".
John.
My position is that with the above satisfied it is okay to merge.
Regards,
Tvrtko