On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote: > This patch series introduces the following features: > > * Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode. > * Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8. > * Feature 3. Context Submission Status Consistency checking The high level design is reasonable and conceptually extends the current system in fairly obvious ways. In terms of discussing the implementation, I think this can be phased as: 0. Move to a per-engine hangcheck 1. Add fake-interrupt recovery for CSSC I think this can be done without changing the hangcheck heuristics at all - we just need to think carefully about recovery (which is a nice precursor to per-engine reset). I may be wrong, and so would like to be told so early on! If the fake interrupt recovery is done as part of the reset handler, we should have (one) fewer concurrency issues to worry about. (In a similar vein, I think we should move the missed interupt handler for legacy out of hangcheck, partly to simplify some very confusing code and partly so that we have fewer deviations between legacy/execlists paths.) It also gets us thinking about the consistency detection and when it is viable to do a fake-interrupt and when we must do a full-reset (for example, we only want to fake-interrupt if the consistency check says the GPU is idle, and we definitely want to reset everything if the GPU is executing an alien context.) A test here would be to suspend the execlists irq and wait for the recovery. Cheekily we could punch the irq eir by root mmio and check hangcheck automagically recovers. Whilst it would be nice to add the watchdog next, since it is conceptually quite simple and basically just a new hangcheck source with fancy setup - fast hangcheck without soft reset makes for an easy DoS. 2. TDR Given that we have a consistency check and begun to extend the reset path, we can implement a soft reset that only skips the hung request. (The devil is in the details, whilst the design here looked solid, I think the LRC recovery code could be simplified - I didn't feel another requeue path was required given that we need only pretend a request completed (fixing up the context image as required) and then use the normal unqueue.) There is also quite a bit of LRC cleanup on the lists which would be useful here. Lots of tests for concurrent engine utilisation, multiple contexts, etc and ensuring that a hang in one does not affect independent work (engines, batches, contexts). 3. Watchdog A new fast hangcheck source. So concurrency, promotion (relative scoring between watchdog / hangcheck) and issue with not programming the ring correctly (beware the interrupt after performing a dispatch and programming the tail, needs either reserved space, inlining into the dispatch etc). The only thing of remark here is the uapi. It is a server feature (interactive clients are less likely to tolerate data being thrown away). Do we want an execbuf bit or context param? An execbuf bit allows switching between two watchdog timers (or on/off), but we have many more context params available than bits. Do we want to expose context params to set the fast/slow timeouts? We haven't found any GL spec that descibe controllable watchdogs, so the ultimate uapi requirements are unknown. My feeling is that we want to do the initial uapi through context params, and start with a single fast watchdog timeout value. This is sufficient for testing and probably for most use cases. This can easily be extended by adding an execbuf bit to switch between two values and a context param to set the second value. (If the second value isn't set, the execbuf bit dosn't do anything the watchdog is always programmed to the user value. If the second value is set (maybe infinite), then the execbuf bit is used to select which timeout to use for that batch.) But given that this is a server-esque feature, it is likely to be a setting the userspace driver imposes upon its clients, and there is unlikely to be the need to switch timeouts within any one client. The tests here would focus on the uapi and ensuring that if the client asks for a 60ms hang detection, then all work packets take at most 60ms to complete. Hmm, add wait_ioctl to the list of uapi that should report -EIO. > drivers/gpu/drm/i915/i915_debugfs.c | 146 +++++- > drivers/gpu/drm/i915/i915_drv.c | 201 ++++++++ > drivers/gpu/drm/i915/intel_lrc.c | 858 ++++++++++++++++++++++++++++++- The balance here feels wrong ;-) -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx