Quoting Chris Wilson (2018-03-27 10:23:26) > Quoting Tvrtko Ursulin (2018-03-27 09:51:20) > > > > On 26/03/2018 12:50, Chris Wilson wrote: > > > +static enum hrtimer_restart preempt_timeout(struct hrtimer *hrtimer) > > > +{ > > > + struct intel_engine_execlists *execlists = > > > + container_of(hrtimer, typeof(*execlists), preempt_timer); > > > + > > > + GEM_TRACE("%s\n", > > > + container_of(execlists, > > > + struct intel_engine_cs, > > > + execlists)->name); > > > + > > > + queue_work(system_highpri_wq, &execlists->preempt_reset); > > > > I suppose indirection from hrtimer to worker is for better than jiffie > > timeout granularity? But then queue_work might introduce some delay to > > defeat that. > > Yes. It's betting on highpri_wq being just that. We can do better with > our own RT kthread and a wakeup from the hrtimer if required. > > Hmm, the original plan (watchdog TDR) was to avoid using the global > reset entirely. The per-engine reset (although needs serialising with > itself) doesn't need struct_mutex, so we should be able to do that from > the timer directly, and just kick off the global reset on failure. > > That sounds a whole lot better, let's dust off that code and see what > breaks. So I think something along the lines of +static int try_execlists_reset(struct intel_engine_execlists *execlists) +{ + struct intel_engine_cs *engine = + container_of(execlists, typeof(*engine), execlists); + int err = -EBUSY; + + if (!test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted) && + tasklet_trylock(&execlists->tasklet)) { + unsigned long *lock = &engine->i915->gpu_error.flags; + unsigned int bit = I915_RESET_ENGINE + engine->id; + + if (!test_and_set_bit(bit, lock)) { + tasklet_disable(&engine->execlists.tasklet); + err = i915_reset_engine(engine, + "preemption timeout"); + tasklet_enable(&engine->execlists.tasklet); + + clear_bit(bit, lock); + wake_up_bit(lock, bit); + } + + tasklet_unlock(&execlists->tasklet); + } + + return err; +} should do the trick, with a fallback to worker+i915_handle_error for the cases where we can't claim ensure softirq safety. There's still the issue of two resets in quick succession being treated as a failure. That's also an issue for the current per-engine failover to whole-device reset. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx