On Wed, Apr 19, 2017 at 10:11:37AM -0700, Michel Thierry wrote: > > > On 19/04/17 03:20, Chris Wilson wrote: > >On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote: > >>*** General *** > >> > >>Watchdog timeout (or "media engine reset") is a feature that allows > >>userland applications to enable hang detection on individual batch buffers. > >>The detection mechanism itself is mostly bound to the hardware and the only > >>thing that the driver needs to do to support this form of hang detection > >>is to implement the interrupt handling support as well as watchdog command > >>emission before and after the emitted batch buffer start instruction in the > >>ring buffer. > >> > >>The principle of the hang detection mechanism is as follows: > >> > >>1. Once the decision has been made to enable watchdog timeout for a > >>particular batch buffer and the driver is in the process of emitting the > >>batch buffer start instruction into the ring buffer it also emits a > >>watchdog timer start instruction before and a watchdog timer cancellation > >>instruction after the batch buffer start instruction in the ring buffer. > >> > >>2. Once the GPU execution reaches the watchdog timer start instruction > >>the hardware watchdog counter is started by the hardware. The counter > >>keeps counting until either reaching a previously configured threshold > >>value or the timer cancellation instruction is executed. > >> > >>2a. If the counter reaches the threshold value the hardware fires a > >>watchdog interrupt that is picked up by the watchdog interrupt handler. > >>This means that a hang has been detected and the driver needs to deal with > >>it the same way it would deal with a engine hang detected by the periodic > >>hang checker. The only difference between the two is that we already blamed > >>the active request (to ensure an engine reset). > >> > >>2b. If the batch buffer completes and the execution reaches the watchdog > >>cancellation instruction before the watchdog counter reaches its > >>threshold value the watchdog is cancelled and nothing more comes of it. > >>No hang is detected. > >> > >>Note about future interaction with preemption: Preemption could happen > >>in a command sequence prior to watchdog counter getting disabled, > >>resulting in watchdog being triggered following preemption. The driver will > >>need to explicitly disable the watchdog counter as part of the > >>preemption sequence. > > > >Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning > >preemption on for the user buffers as it just causes hassle if we allow > >preemption in our preamble + breadcrumb. (And there's little point in > >preempting in the flushes.) > > > > Mid-batch? > The watchdog counter is not aware of MI_ARB_ON_OFF (or any other > cmd) and would keep running / expire. We could call > emit_stop_watchdog unconditionally to prevent this. No, I was thinking of the opposite where we had preemption after the batch. Completely missed the point of the watchdog being abled for the low priority batch then being inherited by the high priority batch - and vice versa that the watchdog counter would not be restored on the context switch back. Does suggest that the watchdog should really be part of the context image... -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx