Re: [RFC 00/11] TDR/watchdog timeout support for gen8

Tomas Elf <tomas.elf@xxxxxxxxx> · Fri, 10 Jul 2015 16:48:53 +0100

On 10/07/2015 16:24, Tomas Elf wrote:
On 09/07/2015 19:47, Chris Wilson wrote:
On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
This patch series introduces the following features:

* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist
mode.
* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
* Feature 3. Context Submission Status Consistency checking

The high level design is reasonable and conceptually extends the current
system in fairly obvious ways.

In terms of discussing the implementation, I think this can be phased
as:

0. Move to a per-engine hangcheck

1. Add fake-interrupt recovery for CSSC

   I think this can be done without changing the hangcheck heuristics at
   all - we just need to think carefully about recovery (which is a nice
   precursor to per-engine reset). I may be wrong, and so would like to
   be told so early on! If the fake interrupt recovery is done as part of
   the reset handler, we should have (one) fewer concurrency issues to
   worry about.

Some points about moving the CSSC out of the hang checker and into the
reset handler:

1. If we deal with consistency rectification in the reset handler the
turnaround time becomes REALLY long:

     a. First you have the time to detect the hang, call
i915_handle_error() that then raises the reset in progress flag,
preventing further submissions to the driver.

     b. Then go all the way to the per-engine recovery path, only to
discover that we've got an inconsistency that has not been handled, fall
back immediately with -EAGAIN and lower the reset in progress flag, let
the system continue running and defer to the next hang check (another
hang detection period)

     c. Once the hang has been detected AGAIN, raise the reset in
progress flag AGAIN and go back to the engine reset path a second time.

     d. At the start of the engine reset path we do the second CSSC
detection and realise that we've got a stable inconsistency that we can
attempt to rectify. We can then try to rectify the inconsistency and go
through with the engine reset... AFTER we've checked that the
inconsistency rectification was indeed effective! If it's not and the
problem remains then we have to fail the engine recovery mode and fall
back to full GPU reset immediately... Which we could have done from the
hang checker if we had just refused to schedule hang recovery and just
let the context submission state inconsistency persist and let the hang
score keep rising until the hang score reached 2*HUNG, which would then
have triggered the full GPU reset fallback from the hang checker (see
path 9/11 for all of this)

As you can see, dealing with context submission state inconsistency in
the reset path is very long-winded way of doing it and does not make it
more reliable. Also, it's more complicated to analyse from a concurrency
point of view since we need to fall back several times and raise and
lower the reset in progress flag, which allows driver submissions to
happen vs. blocks submissions. It basically becomes very difficult to
know what is going on.

2. Secondly, and more importantly, if a watchdog timeout is detected and
we end up in the per-engine hang recovery path and have to fall back due
to an inconsistent context submission state at that point and the hang
checker is turned off then we're irrecoverably hung. Watchdog timeout is
supposed to work without the periodic hang checker but it won't if CSSC
is not ensured at all times. Which is why I chose to override the
i915.enable_hangcheck flag to make sure that the hang checker always
runs consistency pre-checking and reschedules itself if there is more
work pending to make sure that as long as work is pending we do
consistency checking asynchronously regardless of everything else so
that if a watchdog timeout hits we have a consistent state once the
watchdog timeout ends up in per-engine recovery.

Granted, if a watchdog timeout hits after we've first detected the
inconsistency but not yet had time to rectify it it doesn't work if the
hang checker is turned off and we cannot rely on periodic hang checking
to schedule hang recovery in this case - so in that case we're still
irrecoverably stuck. We could make change here and do a one-time
i915.enable_hangcheck override and schedule hang recovery following this
point. If you think it's worth it.

Bottom line: The consistency checking must happen at all times and
cannot be done as a consequence of a scheduled reset if hang checking is
turned off at any point.

As far as concurrency issues in the face of CSSC is concerned,
disregarding the complication of handling CSSC in the recovery path and
relying on deferring to the next hang detection    with all of the
concurrency issues that entails: The question really is what kind of
concurrency issues we're worried about. If the hang checker determines
that we've got a hang then that's a stable state. If the hang checker
consistency pre-check determines that we've got a sustained CSSC
inconsistency then that's stable too. The states are not changing so
whatever we do will not be because we detect the state in the middle of
a state transition and the detection won't be subject to concurrency
effects. If the hang checker decides that the inconsistency needs to be
rectified and fakes the presumably lost interrupt and the real, presumed
lost, interrupt happens to come in at the same time then that's fine,
the CSB buffer check in the execlist interrupt handler is made to cope
with that. We can have X number of calls to the interrupt handler or
just one, the outcome is supposed to be the same - the only thing that
matters is captured context state changes in the CSB buffer that we act
upon.

So I'm not entirely sure what concurrency issues might be reason enough
to move out the CSSC to the hang recovery path. In fact, I'd be more
inclined to create a second async task for it to make sure it's being
run at all times. But in that case we might as well let it stay in the
hang checker.

(In a similar vein, I think we should move the missed
   interupt handler for legacy out of hangcheck, partly to simplify some
   very confusing code and partly so that we have fewer deviations
   between legacy/execlists paths.) It also gets us thinking about the
   consistency detection and when it is viable to do a fake-interrupt and
   when we must do a full-reset (for example, we only want to
   fake-interrupt if the consistency check says the GPU is idle, and we
   definitely want to reset everything if the GPU is executing an alien
   context.)

   A test here would be to suspend the execlists irq and wait for the
   recovery. Cheekily we could punch the irq eir by root mmio and check
   hangcheck automagically recovers.

Whilst it would be nice to add the watchdog next, since it is
conceptually quite simple and basically just a new hangcheck source with
fancy setup - fast hangcheck without soft reset makes for an easy DoS.

2. TDR

   Given that we have a consistency check and begun to extend the reset
   path, we can implement a soft reset that only skips the hung request.
   (The devil is in the details, whilst the design here looked solid, I
   think the LRC recovery code could be simplified - I didn't feel
   another requeue path was required given that we need only pretend a
   request completed (fixing up the context image as required) and then
   use the normal unqueue.) There is also quite a bit of LRC cleanup on
   the lists which would be useful here.

As far as the new requeue (or resubmission) path is concerned, you might
have a point here. The reason it's as involved as it is is probably
mostly because of all the validation that takes place in the
resubmission path. Meaning that once the resubmission happens at the end
of the per-engine hang recovery path we want to make extra sure that the
context that gets resubmitted in the end (the head element of the queue
at that point in time) is in fact the one that was passed down from the
per-engine hang recovery path (the context at the head of the queue at
the start of the hang recovery path), so that the state of the queue
didn't change during hang recovery. Maybe we're too paranoid here.

Ah, yes, there is one crucial difference between the normal 
execlists_context_unqueue() function and execlists_TDR_context_unqueue() 
that means that we cannot fully reuse the former one for TDR-specific 
purposes.

When we resubmit the context via TDR it's important that we do not 
increment the elsp_submitted counter and otherwise treat the 
resubmission as a normal submission. The reason for this is that the 
hardware consistently refuses to send out any kind of interrupt 
acknowledging the context resubmission. If you just submit the context 
and increment elsp_submitted for that context, like you normally do, the 
interrupt handler will sit around forever waiting for the interrupt that 
will never come for the resubmission. It will wait - and receive - the 
interrupt for the original submission that caused the original hang and 
also the interrupt for the context completion following hang recovery. 
But it won't receive an interrupt for the resubmission.

Also, there are two cases here:

1. Only one context was in flight when the hang happened.

Head element has elsp_submitted > 0 and the second in line has 
elsp_submitted == 0 . Normally, the _unqueue function would pick up any 
contexts that are pending and just submit them but the TDR context 
resubmission function should only resubmit exactly the context that was 
hung to get that out of the way. In this case, if we submit both pending 
contexts to the hardware that means that this resubmission accidentally 
causes one resubmission (the hung context) and one first-time submission 
(the next context in line). Not incrementing elsp_submitted for the hung 
context is ok, but the hardware _will_ in fact send out an interrupt for 
the second context that was never submitted before. In that case the 
interrupt handler will pick up an interrupt _too many_ compared to the 
respective elsp_submitted count. Therefore we need to avoid this case by 
not picking up the second context in line.

2. Both the head context and the second context in the execlist queue 
were in flight when the hang happened (both have elsp_submitted > 0).

In this case we need to resubmit both contexts and increment none of 
their elsp_submitted counts. The reason for this is that if we only 
resubmit the head element and not the second one in this case the 
resubmission will clear the hung context and the interrupt for the 
original context submission (not the resubmission) will be received. In 
this case the interrupt handler will clear the hung context from the 
queue and see that there are contexts pending and will then unqueue the 
next context in line, that was previously submitted at the time of the 
hang, thus submitting that context _a second time_. Doing so will drive 
up the elsp_submitted count to 2 for that context, but we will never get 
any more than one interrupt back for that context - thus causing a hang 
in the interrupt handler.

Basically, when doing context resubmission following the per-engine hang 
recovery we need to restore the exact submission state that was in 
progress at the time of the hang and resubmit EXACTLY the contexts that 
were in flight at the time. And not touch the elsp_submitted values for 
any of them. That is absolutely not what the normal _unqueue function 
does and therefore we cannot reuse it as is.

Granted, we could've passed a parameter to the normal _unqueue function 
to separate normal context unqueueing from TDR-specific unqueueing but 
there really is not much overlap between the two cases. Also, seeing as 
there is value in the context validation we do in 
intel_execlists_TDR_context_queue() we might as well have a separate 
path for TDR-specific context resubmission.

Does that make any sense?

Thanks,
Tomas

   Lots of tests for concurrent engine utilisation, multiple contexts,
   etc and ensuring that a hang in one does not affect independent work
   (engines, batches, contexts).

3. Watchdog

   A new fast hangcheck source. So concurrency, promotion (relative
   scoring between watchdog / hangcheck) and issue with not programming
   the ring correctly (beware the interrupt after performing a dispatch
   and programming the tail, needs either reserved space, inlining into
   the dispatch etc).

   The only thing of remark here is the uapi. It is a server feature
   (interactive clients are less likely to tolerate data being thrown
   away). Do we want an execbuf bit or context param? An execbuf bit
   allows switching between two watchdog timers (or on/off), but we
   have many more context params available than bits. Do we want to
   expose context params to set the fast/slow timeouts?

   We haven't found any GL spec that descibe controllable watchdogs, so
   the ultimate uapi requirements are unknown. My feeling is that we want
   to do the initial uapi through context params, and start with a single
   fast watchdog timeout value. This is sufficient for testing and
   probably for most use cases. This can easily be extended by adding an
   execbuf bit to switch between two values and a context param to set
   the second value. (If the second value isn't set, the execbuf bit
   dosn't do anything the watchdog is always programmed to the user
   value. If the second value is set (maybe infinite), then the execbuf
   bit is used to select which timeout to use for that batch.) But given
   that this is a server-esque feature, it is likely to be a setting the
   userspace driver imposes upon its clients, and there is unlikely to be
   the need to switch timeouts within any one client.

This may or may not be true. We need to thrash out these details. As you
said, the requirements are quite fuzzy at this point. In the end a good
method might be to just get something in there in cooperation with ONE
open source user and let all other users scream after it's gone in there
and then extend the interface (without breaking ABI) to accomodate the
other users. I've tried to drum up enthusiasm for this new feature but
so far it's not been overwhelming so it's difficult to solve this
chicken and egg problem without proper input from userland users.

If someone writes something in stone and tells me to implement exactly
that then I'll do it but so far there has been no convincing argument
pointing to any particular design aside from the choice of default
timeout, which was actually decided in collaboration with various
userland groups and was nothing we just made up by ourselves.

   The tests here would focus on the uapi and ensuring that if the client
   asks for a 60ms hang detection, then all work packets take at most
   60ms to complete. Hmm, add wait_ioctl to the list of uapi that should
   report -EIO.

  drivers/gpu/drm/i915/i915_debugfs.c     |  146 +++++-
  drivers/gpu/drm/i915/i915_drv.c         |  201 ++++++++
  drivers/gpu/drm/i915/intel_lrc.c        |  858
++++++++++++++++++++++++++++++-

The balance here feels wrong ;-)

Once we get gen7 support in there, after this RFC, I can assure you that
it will even out in regards to intel_ringbuffer.c and other files. The
gen agnostic TDR framework does focus a lot on i915_drv.c and
i915_irq.c, intel_lrc.c is heavy because our principal implementation
focuses on gen8 in execlist mode which is localized in intel_lrc.c .

But, yeah, I get what you're saying ;). Just stating for the record.

Thanks,
Tomas

-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx