Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> writes: > After starting hard at sequences like > > [ 28.199013] systemd-1 2..s. 26062228us : execlists_submission_tasklet: rcs0 cs-irq head=0 [0?], tail=1 [1?] > [ 28.199095] systemd-1 2..s. 26062229us : execlists_submission_tasklet: rcs0 csb[1]: status=0x00000018:0x00000000, active=0x1 > [ 28.199177] systemd-1 2..s. 26062230us : execlists_submission_tasklet: rcs0 out[0]: ctx=0.1, seqno=3, prio=-1024 > [ 28.199258] systemd-1 2..s. 26062231us : execlists_submission_tasklet: rcs0 completed ctx=0 > [ 28.199340] gem_eio-829 1..s1 26066853us : execlists_submission_tasklet: rcs0 in[0]: ctx=1.1, seqno=1, prio=0 > [ 28.199421] <idle>-0 2..s. 26066863us : execlists_submission_tasklet: rcs0 cs-irq head=1 [1?], tail=2 [2?] > [ 28.199503] <idle>-0 2..s. 26066865us : execlists_submission_tasklet: rcs0 csb[2]: status=0x00000001:0x00000000, active=0x1 > [ 28.199585] gem_eio-829 1..s1 26067077us : execlists_submission_tasklet: rcs0 in[1]: ctx=3.1, seqno=2, prio=0 > [ 28.199667] gem_eio-829 1..s1 26067078us : execlists_submission_tasklet: rcs0 in[0]: ctx=1.2, seqno=1, prio=0 > [ 28.199749] <idle>-0 2..s. 26067084us : execlists_submission_tasklet: rcs0 cs-irq head=2 [2?], tail=3 [3?] > [ 28.199830] <idle>-0 2..s. 26067085us : execlists_submission_tasklet: rcs0 csb[3]: status=0x00008002:0x00000001, active=0x1 > [ 28.199912] <idle>-0 2..s. 26067086us : execlists_submission_tasklet: rcs0 out[0]: ctx=1.2, seqno=1, prio=0 > [ 28.199994] gem_eio-829 2..s. 28246084us : execlists_submission_tasklet: rcs0 cs-irq head=3 [3?], tail=4 [4?] > [ 28.200096] gem_eio-829 2..s. 28246088us : execlists_submission_tasklet: rcs0 csb[4]: status=0x00000014:0x00000001, active=0x5 > [ 28.200178] gem_eio-829 2..s. 28246089us : execlists_submission_tasklet: rcs0 out[0]: ctx=0.0, seqno=0, prio=0 > [ 28.200260] gem_eio-829 2..s. 28246127us : execlists_submission_tasklet: execlists_submission_tasklet:886 GEM_BUG_ON(buf[2 * head + 1] != port->context_id) > > the conclusion is that the only place where the ports are reset to zero, > is from engine->cancel_requests called during i915_gem_set_wedged(). > > The race is horrible as it results from calling set-wedged on active HW > (the GPU reset failed) and as such we need to be careful as the HW state > changes beneath us. Fortunately, it's the same scary conditions as > affect normal reset, so we can reuse the same machinery to disable state > tracking as we clobber it. > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104945 > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > Cc: Michel Thierry <michel.thierry@xxxxxxxxx> > --- > drivers/gpu/drm/i915/i915_gem.c | 6 +++++- > drivers/gpu/drm/i915/intel_lrc.c | 5 +++++ > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index c29b1a1cbe96..dcdcc09240b9 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -3212,8 +3212,10 @@ void i915_gem_set_wedged(struct drm_i915_private *i915) > * rolling the global seqno forward (since this would complete requests > * for which we haven't set the fence error to EIO yet). > */ > - for_each_engine(engine, i915, id) > + for_each_engine(engine, i915, id) { > + i915_gem_reset_prepare_engine(engine); > engine->submit_request = nop_submit_request; > + } > > /* > * Make sure no one is running the old callback before we proceed with > @@ -3255,6 +3257,8 @@ void i915_gem_set_wedged(struct drm_i915_private *i915) > intel_engine_init_global_seqno(engine, > intel_engine_last_submit(engine)); > spin_unlock_irqrestore(&engine->timeline->lock, flags); > + > + i915_gem_reset_finish_engine(engine); > } > > wake_up_all(&i915->gpu_error.reset_queue); > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 14288743909f..c1a3636e94fc 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -687,6 +687,8 @@ static void execlists_cancel_requests(struct intel_engine_cs *engine) > struct rb_node *rb; > unsigned long flags; > > + GEM_TRACE("%s\n", engine->name); > + > spin_lock_irqsave(&engine->timeline->lock, flags); > > /* Cancel the requests on the HW and clear the ELSP tracker. */ > @@ -733,6 +735,9 @@ static void execlists_cancel_requests(struct intel_engine_cs *engine) > */ > clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); > > + /* Mark all CS interrupts as complete */ > + execlists->active = 0; With the followup patch to handle the other irq state manipulation inside timeline lock, albeit it feels a little like borrowing a lock, I am content. Reviewed-by: Mika Kuoppala <mika.kuoppala@xxxxxxxxxxxxxxx> > + > spin_unlock_irqrestore(&engine->timeline->lock, flags); > } > > -- > 2.16.2 _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx