Quoting Tvrtko Ursulin (2018-03-27 09:51:20) > > On 26/03/2018 12:50, Chris Wilson wrote: > > +static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer) > > +{ > > + struct intel_engine_execlists *execlists = > > + container_of(hrtimer, typeof(*execlists), preempt_timer); > > + > > + GEM_TRACE("%s\n", > > + container_of(execlists, > > + struct intel_engine_cs, > > + execlists)->name); > > + > > + queue_work(system_highpri_wq, &execlists->preempt_reset); > > I suppose indirection from hrtimer to worker is for better than jiffie > timeout granularity? But then queue_work might introduce some delay to > defeat that. Yes. It's betting on highpri_wq being just that. We can do better with our own RT kthread and a wakeup from the hrtimer if required. Hmm, the original plan (watchdog TDR) was to avoid using the global reset entirely. The per-engine reset (although needs serialising with itself) doesn't need struct_mutex, so we should be able to do that from the timer directly, and just kick off the global reset on failure. That sounds a whole lot better, let's dust off that code and see what breaks. > I am wondering if simply schedule_delayed_work directly wouldn't be good > enough. I suppose it is a question for the product group. But it is also > implementation detail. > > > + return HRTIMER_NORESTART; > > +} > > + > > +static void preempt_reset(struct work_struct *work) > > +{ > > + struct intel_engine_cs *engine = > > + container_of(work, typeof(*engine), execlists.preempt_reset); > > + > > + GEM_TRACE("%s\n", engine->name); > > + > > + tasklet_disable(&engine->execlists.tasklet); > > + > > + engine->execlists.tasklet.func(engine->execlists.tasklet.data); > > Comment on why calling the tasklet directly. /* ksoftirqd hates me; I hate ksoftirqd. */ > > + > > + if (execlists_is_active(&engine->execlists, EXECLISTS_ACTIVE_PREEMPT)) > > + i915_handle_error(engine->i915, BIT(engine->id), 0, > > + "preemption timed out on %s", engine->name); > > Can this race with the normal reset and we end up wit i915_handle_error > twice simultaneously? i915_handle_error() serialises itself against multiple calls to the same set of engines and against a global reset. There's a window for us to try and reset the same request twice (in quick succession) though. That's not good. Also I think we need some more hand holding here to provoke the right reset. Hmm. We may need to break i915_handle_error() down quite a bit. (Ok decided to try and not use i915_handle_error() here, at least not on the immediate path.) Hmm, going back to that discussion, we will have to dig through the details on when exactly hrtimer comes from softirq, or else we may deadlock with waiting on the tasklet. Hmm indeed. > > + tasklet_enable(&engine->execlists.tasklet); > > } > > > > static void complete_preempt_context(struct intel_engine_execlists *execlists) > > @@ -542,6 +583,10 @@ static void complete_preempt_context(struct intel_engine_execlists *execlists) > > execlists_cancel_port_requests(execlists); > > execlists_unwind_incomplete_requests(execlists); > > > > + /* If the timer already fired, complete the reset */ > > + if (hrtimer_try_to_cancel(&execlists->preempt_timer) < 0) > > + return; > > What about timer which had already fired and queued the worker? > hrtimer_try_to_cancel will return zero for that case I think. That's -1. 0 is the timer already fired and executed. 1 is pending. If the timer fired, executed its callback and then we execute, we clear the pending flag and proceed as normal (also when we get called from inside the worker after the hrtimer). The worker sees the cleared flag in that case and skips the reset. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx