On 03/03/2022 19:09, John Harrison wrote:
Actions:
1)
Get a number from compute/OpenCL people for what they say is minimum
preempt timeout for default out of the box Linux desktop experience.
That would be the one that has been agreed upon by both linux software
arch and all UMD teams and has been in use for the past year or more in
the internal tree.
What has been used in the internal tree is irrelevant when UMD ack is needed for changes which affect upstream shipping platforms like Tigerlake.
This does not mean them running some tests and can't be bothered to
setup up the machine for the extreme use cases, but workloads average
users can realistically be expected to run.
Say for instance some image manipulation software which is OpenCL
accelerated or similar. How long unpreemptable sections are expected
there. Or similar. I am not familiar what all OpenCL accelerated use
cases there are on Linux.
And this number should be purely about minimum preempt timeout, not
considering heartbeats. This is because preempt timeout may kick in
sooner than stopped heartbeat if the user workload is low priority.
And driver is simply hosed in the intervening six months or more that it
takes for the right people to find the time to do this.
What is hosed? Driver currently contains a patch which was acked by the compute UMD to disable preemption. If it takes six months for compute UMD to give us a different number which works for the open source stack and typical use cases then what can we do.
Right now, it is broken. This patch set improves things. Actual numbers
can be refined later as/when some random use case that we hadn't
previously thought of pops up. But not fixing the basic problem at all
until we have an absolutely perfect for all parties solution is
pointless. Not least because there is no perfect solution. No matter
what number you pick it is going to be wrong for someone.
2.5s, 7.5s, X.Ys, I really don't care. 2.5s is a number you seem to have
picked out of the air totally at random, or maybe based on it being the
heartbeat period (except that you keep arguing that basing tP on tH is
wrong). 7.5s is a number that has been in active use for a lot of
testing for quite some time - KMD CI, UMD CI, E2E, etc. But either way,
the initial number is almost irrelevant as long as it is not zero. So
can we please just get something merged now as a starting point?
2)
Commit message should explain the effect on the worst case time until
engine reset.
3)
OpenCL/compute should ack the change publicly as well since they acked
the disabling of preemption.
This patch set has already been publicly acked by the compute team. See
the 'acked-by' tag.
I can't find the reply which contained the ack on the mailing list - do you have a msg-id or an archive link?
Also, ack needs to be against the the fixed timeout patch and not one dependent on the heartbeat interval.
4)
I really want overflows_type in the first patch.
In the final GuC assignment? Only if it is a BUG_ON. If we get a failure
there it is an internal driver error and cannot be corrected for. It is
too late for any plausible range check action.
If you can find a test which exercises setting insane values to the relevant timeouts and so would hit the problem in our CI then BUG_ON is fine. Otherwise I think BUG_ON is too anti-social and prefer drm_warn or drm_WARN_ON. I don't think adding a test is strictly necessary, if we don't already have one, given how unlikely this is too be hit, but if you insist on a BUG_ON instead of a flavour of a warn then I think we need one so we can catch in CI 100% of the time.
And if you mean in the the actual helper function with the rest of the
clamping then you are bleeding internal GuC API structure details into
non-GuC code. Plus the test would be right next to the 'if (size <
In my other reply I exactly described that would be a downside and that I prefer checks at the assignment sites.
Also regarding this comment in the relevant patch:
+ /*
+ * NB: The GuC API only supports 32bit values. However, the limit is further
+ * reduced due to internal calculations which would otherwise overflow.
+ */
I would suggest clarifying this as "The GuC API only supports timeouts up to U32_MAX micro-seconds. However, ...". Given the function at hand deals in milliseconds explicitly calling out that additional scaling factor makes sense I think.
Big picture - it's really still very simple. Public ack for a fixed number and a warn on is not really a lot to ask.
Regards,
Tvrtko