On Tue, 2019-03-19 at 12:46 +0000, Tvrtko Ursulin wrote: > On 19/03/2019 12:39, Tvrtko Ursulin wrote: > > > > On 18/03/2019 00:15, Carlos Santa wrote: > > > On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote: > > > > On 08/03/2019 03:16, Carlos Santa wrote: > > > > > On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote: > > > > > > > > > > > > > > > > > > > Quoting Carlos Santa (2019-02-21 02:58:16) > > > > > > > +#define GEN8_WATCHDOG_1000US(dev_priv) > > > > > > > watchdog_to_clock_counts(dev_priv, 1000) > > > > > > > +static void gen8_watchdog_irq_handler(unsigned long > > > > > > > data) > > > > > > > +{ > > > > > > > + struct intel_engine_cs *engine = (struct > > > > > > > intel_engine_cs > > > > > > > *)data; > > > > > > > + struct drm_i915_private *dev_priv = engine->i915; > > > > > > > + unsigned int hung = 0; > > > > > > > + u32 current_seqno=0; > > > > > > > + char msg[80]; > > > > > > > + unsigned int tmp; > > > > > > > + int len; > > > > > > > + > > > > > > > + /* Stop the counter to prevent further timeout > > > > > > > interrupts > > > > > > > */ > > > > > > > + I915_WRITE_FW(RING_CNTR(engine->mmio_base), > > > > > > > get_watchdog_disable(engine)); > > > > > > > + > > > > > > > + /* Read the heartbeat seqno once again to check > > > > > > > if we > > > > > > > are > > > > > > > stuck? */ > > > > > > > + current_seqno = > > > > > > > intel_engine_get_hangcheck_seqno(engine); > > > > > > > > > > > > I have said this before, but this doesn't exist either, > > > > > > it's just > > > > > > a > > > > > > temporary glitch in the matrix. > > > > > > > > > > > > > > > > Chris, Tvrtko, I need some guidance on how to find the quilty > > > > > seqno > > > > > during a hang, can you please advice here what to do? > > > > > > > > When an interrupt fires you need to ascertain whether the same > > > > request > > > > which enabled the watchdog is running, correct? > > > > > > > > So I think you would need this, with a disclaimer that I > > > > haven't > > > > thought > > > > about the details really: > > > > > > > > 1. Take a reference to timeline hwsp when setting up the > > > > watchdog for > > > > a > > > > request. > > > > > > > > 2. Store the initial seqno associated with this request. > > > > > > > > 3. Force enable user interrupts. > > > > > > > > 4. When timeout fires, inspect the HWSP seqno to see if the > > > > request > > > > completed or not. > > > > > > > > 5. Reset the engine if not completed. > > > > > > > > 6. Put the timeline/hwsp reference. > > > > > > > > > static int gen8_emit_bb_start(struct i915_request *rq, > > > u64 offset, u32 > > > len, > > > const unsigned > > > int flags) > > > { > > > struct i915_timeline *tl; > > > u32 seqno; > > > > > > if (enable_watchdog) { > > > /* Start watchdog timer */ > > > cs = gen8_emit_start_watchdog(rq, cs); > > > tl = ce->ring->timeline; > > > i915_timeline_get_seqno(tl, rq, &seqno); > > > /*Store initial hwsp seqno associated with this request > > > engine->watchdog_hwsp_seqno = tl->hwsp_seqno; > > > > You should not need to allocate a new seqno and also having > > something > > stored per engine does not make clear how will you solve out of > > order. Understood, I missed that there's a convenience pointer available to us per request (i.e., *hwsp_seqno). On step #1 above you have said to take a reference to the timeline so I was trying to make a link between the timeline and the seqno but if the request comes already with a convenience pointer then we may not need the timeline after all... However, on v4 of the series I was using intel_engine_get_hangcheck_seqno(engine) for this purpose, and even though Chris was against it, I saw that it landed recently on the tree... > > > > Maybe you just set up the timer, then lets see below.. I think you're suggesting simply not to bother checking for the guilty seqno in the tasklet and simply reset... > > > > Also, are you not trying to do the software implementation to start > > with? Trying to keep it simple with just the h/w timers for now... adding front/back end to accomodate the s/w timers will just muddy the waters? Will get to it once we agree on what to do here... > > > > > } > > > > > > } > > > > > > static void gen8_watchdog_tasklet(unsigned long data) > > > { > > > struct i915_request *rq; > > > > > > rq = intel_engine_find_active_request(engine); > > > > > > /* Inspect the watchdog seqno once again for > > > completion? */ > > > if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq- > > > > fence.seqno)) { > > > > > > //Reset Engine > > > } > > > } > > > > What happens if you simply reset without checking anything? You > > know hw > > timer wouldn't have fired if the context wasn't running, correct? Need to verify this by running some tests then... > > > > (Ignoring the race condition between interrupt raised -> hw > > interrupt > > delivered -> serviced -> tasklet scheduled -> tasklet running. > > Which may > > mean request has completed in the meantime and you reset the engine > > for > > nothing. But this is probably not 100% solvable.) > > Good idea would be to write some tests to exercise some normal and > more > edge case scenarios like coalesced requests, preemption etc. > Checking > which request got reset etc. Ok, need to try some test cases then. Regards, Carlos > > Regards, > > Tvrtko > > > Regards, > > > > Tvrtko > > > > > Tvrtko, is the above acceptable to inspect whether the seqno has > > > completed? > > > > > > I noticed there's a helper function i915_request_completed(struct > > > i915_request *rq) but it will require me to modify it in order to > > > pass > > > 2 different seqnos. > > > > > > Regards, > > > Carlos > > > > > > > > > > > If the user interrupt fires with the request completed cancel > > > > the > > > > above > > > > operations. > > > > > > > > There could be an inherent race between inspecting the seqno > > > > and > > > > deciding to reset. Not sure at the moment what to do. Maybe > > > > just call > > > > it > > > > bad luck? > > > > > > > > I also think for the software implementation you need to force > > > > no > > > > request coalescing for contexts with timeout set. Because you > > > > want > > > > to > > > > have 100% defined borders for request in and out - since the > > > > timeout > > > > is > > > > defined per request. > > > > > > > > In this case you don't need the user interrupt for the trailing > > > > edge > > > > signal but can use context complete. Maybe putting hooks into > > > > context_in/out in intel_lrc.c would work under these > > > > circumstances. > > > > > > > > Also if preempted you need to cancel the timer setup and store > > > > elapsed > > > > execution time. > > > > > > > > Or it may make sense to just disable preemption for these > > > > contexts. > > > > Otherwise there is no point in trying to mandate the timeout? > > > > > > > > But it is also kind of bad since non-privileged contexts can > > > > make > > > > themselves non-preemptable by setting the watchdog timeout. > > > > > > > > Maybe as a compromise we need to automatically apply an > > > > elevated > > > > priority level, but not as high to be completely non- > > > > preemptable. > > > > Sounds > > > > like a hard question. > > > > > > > > Regards, > > > > > > > > Tvrtko > > > > > > _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx