On 11/01/19 00:22, Tvrtko Ursulin wrote:
On 11/01/2019 00:47, Antonio Argenziano wrote:
On 07/01/19 08:58, Tvrtko Ursulin wrote:
On 07/01/2019 13:57, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
On 07/01/2019 11:58, Tvrtko Ursulin wrote:
[snip]
Note about future interaction with preemption: Preemption could
happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g.
when
watchdog had been enabled in the low priority batch). The driver
will
need to explicitly disable the watchdog counter as part of the
preemption sequence.
Does the series take care of preemption?
I did not find that it does.
Oh. I hoped that the watchdog was saved as part of the context... Then
despite preemption, the timeout would resume from where we left off as
soon as it was back on the gpu.
If the timeout remaining was context saved it would be much simpler (at
least on first glance), please say it is.
I made my comments going only by the text from the commit message and
the absence of any preemption special handling.
Having read the spec, the situation seems like this:
* Watchdog control and threshold register are context saved and
restored.
* On a context switch watchdog counter is reset to zero and
automatically disabled until enabled by a context restore or explicitly.
So it sounds the commit message could be wrong that special handling
is needed from this direction. But read till the end on the
restriction listed.
* Watchdog counter is reset to zero and is not accumulated across
multiple submission of the same context (due preemption).
I read this as - after preemption contexts gets a new full timeout
allocation. Or in other words, if a context is preempted N times,
it's cumulative watchdog timeout will be N * set value.
This could be theoretically exploitable to bypass the timeout. If a
client sets up two contexts with prio -1 and -2, and keeps submitting
periodical no-op batches against prio -1 context, while prio -2 is
it's own hog, then prio -2 context defeats the watchdog timer. I
think.. would appreciate is someone challenged this conclusion.
I think you are right that is a possibility but, is that a problem?
The client can just not set the threshold to bypass the timeout. Also
because you need the hanging batch to be simply preemptible, you
cannot disrupt any work from another client that is higher priority.
This is
But I think higher priority client can have the same effect on the lower
priority purely by accident, no?
As a real world example, user kicks off an background transcoding job,
which happens to use prio -2, and uses the watchdog timer.
At the same time user watches a video from a player of normal priority.
This causes periodic, say 24Hz, preemption events, due frame decoding
activity on the same engine as the transcoding client.
Does this defeat the watchdog timer for the former is the question? Then
the questions of can we do something about it and whether it really
isn't a problem?
I guess it depends if you consider that timeout as the maximum lifespan
a workload can have or max contiguous active time.
Maybe it is not disrupting higher priority clients but it is causing an
time unbound power drain.
pretty much the same behavior of hangcheck IIRC so something we
already accept.
You mean today hangcheck wouldn't notice a hanging batch in the same
scenario as above? If so it sounds like a huge gap we need to try and fix.
My understanding of it is that we only keep a record of what was running
the last time hangcheck was run so it is possible to trick it into
resetting when a preemption occurs but I could be missing something.
And finally there is one programming restriction which says:
* SW must not preempt the workload which has watchdog enabled.
Either it must:
a) disable preemption for that workload completely, or
b) disable the watchdog via mmio write before any write to ELSP
This seems it contradiction with the statement that the counter gets
disabled on context switch and stays disabled.
I did not spot anything like this in the series. So it would seem the
commit message is correct after all.
It would be good if someone could re-read the bspec text on register
0x2178 to double check what I wrote.
The way I read it is that the restriction applies only to some
platforms where the 'normal' description doesn't apply.
You are right. Are the listed parts in the field so the series would
have to handle this or we can ignore it?
I think there is something we need to handle e.g. BXT.
Antonio
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx