Re: drm/i915: Watchdog timeout: IRQ handler for gen8+

John Harrison <John.C.Harrison@xxxxxxxxx> · Fri, 11 Jan 2019 13:28:07 -0800

On 1/11/2019 09:31, Antonio Argenziano wrote:

On 11/01/19 00:22, Tvrtko Ursulin wrote:

On 11/01/2019 00:47, Antonio Argenziano wrote:
On 07/01/19 08:58, Tvrtko Ursulin wrote:
On 07/01/2019 13:57, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-01-07 13:43:29)

On 07/01/2019 11:58, Tvrtko Ursulin wrote:

[snip]

Note about future interaction with preemption: Preemption could 
happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption 
(e.g. when
watchdog had been enabled in the low priority batch). The 
driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

Does the series take care of preemption?

I did not find that it does.

Oh. I hoped that the watchdog was saved as part of the context... 
Then
despite preemption, the timeout would resume from where we left 
off as
soon as it was back on the gpu.

If the timeout remaining was context saved it would be much 
simpler (at
least on first glance), please say it is.

I made my comments going only by the text from the commit message 
and the absence of any preemption special handling.

Having read the spec, the situation seems like this:

  * Watchdog control and threshold register are context saved and 
restored.

  * On a context switch watchdog counter is reset to zero and 
automatically disabled until enabled by a context restore or 
explicitly.

So it sounds the commit message could be wrong that special 
handling is needed from this direction. But read till the end on 
the restriction listed.

  * Watchdog counter is reset to zero and is not accumulated across 
multiple submission of the same context (due preemption).

I read this as - after preemption contexts gets a new full timeout 
allocation. Or in other words, if a context is preempted N times, 
it's cumulative watchdog timeout will be N * set value.

This could be theoretically exploitable to bypass the timeout. If a 
client sets up two contexts with prio -1 and -2, and keeps 
submitting periodical no-op batches against prio -1 context, while 
prio -2 is it's own hog, then prio -2 context defeats the watchdog 
timer. I think.. would appreciate is someone challenged this 
conclusion.

I think you are right that is a possibility but, is that a problem? 
The client can just not set the threshold to bypass the timeout. 
Also because you need the hanging batch to be simply preemptible, 
you cannot disrupt any work from another client that is higher 
priority. This is 

But I think higher priority client can have the same effect on the 
lower priority purely by accident, no?

As a real world example, user kicks off an background transcoding 
job, which happens to use prio -2, and uses the watchdog timer.

At the same time user watches a video from a player of normal 
priority. This causes periodic, say 24Hz, preemption events, due 
frame decoding activity on the same engine as the transcoding client.

Does this defeat the watchdog timer for the former is the question? 
Then the questions of can we do something about it and whether it 
really isn't a problem?

I guess it depends if you consider that timeout as the maximum 
lifespan a workload can have or max contiguous active time.

I believe the intended purpose of the watchdog is to prevent broken 
bitstreams hanging the transcoder/player. That is, it is a form of error 
detection used by the media driver to handle bad user input. So if there 
is a way for the watchdog to be extended indefinitely under normal 
situations, that would be a problem. It means the transcoder will not 
detect the broken input data in a timely manner and effectively hang 
rather than skip over to the next packet. And note that broken input 
data can be caused by something as innocent as a dropped packet due to 
high network contention. No need for any malicious activity at all.

John.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx