Re: [PATCH] drm/i915: Disable per-engine reset for Broxton

Michel Thierry <michel.thierry@xxxxxxxxx> · Thu, 06 Jul 2017 16:55:13 -0007

On Thu, Jul 6, 2017 at 12:11 AM, Chris Wilson 
<chris@xxxxxxxxxxxxxxxxxx> wrote:
Quoting Michel Thierry (2017-07-06 02:24:26)
 On 04/07/17 09:09, Chris Wilson wrote:
 > Triggering a GPU reset for one engine affects another, notably
 > corrupting the context status buffer (CSB) effectively losing 
track of
 > inflight requests.
 >
 > Adding a few printks:
 > diff --git a/drivers/gpu/drm/i915/i915_drv.c 
b/drivers/gpu/drm/i915/i915_drv.c
 > index ad41836fa5e5..a969456bc0fa 100644
 > --- a/drivers/gpu/drm/i915/i915_drv.c
 > +++ b/drivers/gpu/drm/i915/i915_drv.c
 > @@ -1953,6 +1953,7 @@ int i915_reset_engine(struct 
intel_engine_cs *engine)
 >                 goto out;
 >         }
 >
 > +       pr_err("Resetting %s\n", engine->name);
 >         ret = intel_gpu_reset(engine->i915, 
intel_engine_flag(engine));
 >         if (ret) {
 >                 /* If we fail here, we expect to fallback to a 
global reset */
 > diff --git a/drivers/gpu/drm/i915/intel_lrc.c 
b/drivers/gpu/drm/i915/intel_lrc.c
 > index 716e5c9ea222..a72bc35d0870 100644
 > --- a/drivers/gpu/drm/i915/intel_lrc.c
 > +++ b/drivers/gpu/drm/i915/intel_lrc.c
 > @@ -355,6 +355,7 @@ static void execlists_submit_ports(struct 
intel_engine_cs *engine)
 >                                 
execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
 >                         port_set(&port[n], port_pack(rq, count));
 >                         desc = execlists_update_context(rq);
 > +                       pr_err("%s: in (rq=%x) ctx=%d\n", 
engine->name, rq->global_seqno, upper_32_bits(desc));
 >                         GEM_DEBUG_EXEC(port[n].context_id = 
upper_32_bits(desc));
 >                 } else {
 >                         GEM_BUG_ON(!n);
 > @@ -594,9 +595,23 @@ static void intel_lrc_irq_handler(unsigned 
long data)
 >                         if (!(status & 
GEN8_CTX_STATUS_COMPLETED_MASK))
 >                                 continue;
 >
 > +                       pr_err("%s: out CSB (%x head=%d, 
tail=%d), ctx=%d, rq=%d\n",
 > +                                       engine->name,
 > +                                       readl(csb_mmio),
 > +                                       head, tail,
 > +                                       readl(buf+2*head+1),
 > +                                       port->context_id);
 > +
 >                         /* Check the context/desc id for this 
event matches */
 > -                       GEM_DEBUG_BUG_ON(readl(buf + 2 * head + 
1) !=
 > -                                        port->context_id);
 > +                       if (readl(buf + 2 * head + 1) != 
port->context_id) {
 > +                               pr_err("%s: BUG CSB (%x head=%d, 
tail=%d), ctx=%d, rq=%d\n",
 > +                                               engine->name,
 > +                                               readl(csb_mmio),
 > +                                               head, tail,
 > +                                               
readl(buf+2*head+1),
 > +                                               port->context_id);
 > +                               BUG();
 > +                       }
 >
 >                         rq = port_unpack(port, &count);
 >                         GEM_BUG_ON(count == 0);
 >
 > Results in:
 >
 > [ 6423.006602] Resetting rcs0
 > [ 6423.009080] rcs0: in (rq=fffffe70) ctx=1
 > [ 6423.009216] rcs0: in (rq=fffffe6f) ctx=3
 > [ 6423.009542] rcs0: out CSB (2 head=1, tail=2), ctx=3, rq=3
 > [ 6423.009619] Resetting bcs0
 > [ 6423.009980] rcs0: BUG CSB (0 head=1, tail=2), ctx=0, rq=3
 >

 It took me a while, but  I was able to replicate this (with your 
 igt_reset_active_engines) a couple of times. I also captured the 
value 
 of the CSB events at that point and it looked like this.

Ah, that's a separate issue that definitely isn't limited to bxt. In 
the
bug on my machine the CSB is distinctly zeroed.

Oh, if you saw them being zeroed, then there's not much I can argue (do 
you remember if the CSB write pointer was reset to '7' too?).

Hopefully is only limited to bxt, so sadly 

Reviewed-by: Michel Thierry <michel.thierry@xxxxxxxxx>

 [   55.134393] Resetting rcs0
 [   55.134747] bcs0: BUG CSB (3 head=1, tail=2), ctx=10, rq=4
 [   55.134755]  bcs0: HWSP[16-17] Execlist CSB[0]:   0x00000018 _ 
0x0000000a
 [   55.134759]  bcs0: HWSP[18-19] Execlist CSB[1]:   0x00000012 _ 
0x0000000a
 [   55.134762]  bcs0: HWSP[20-21] Execlist CSB[2]:   0x00008002 _ 
0x00000004
 [   55.134765]  bcs0: HWSP[22-23] Execlist CSB[3]:   0x00000014 _ 
0x00000004
 [   55.134767]  bcs0: HWSP[24-25] Execlist CSB[4]:   0x00000018 _ 
0x0000000a
 [   55.134770]  bcs0: HWSP[26-27] Execlist CSB[5]: 0x00000001 _ 
0x00000000

 The problem is ctx 10 finished in CSB[0] (ctx_complete & 
 active_to_idle), but then somehow CSB[1] has the same ctx 10 with 
 'preempted' & ctx_complete.

 To make things worse, in CSB[2], the hw claims to have 
lite-restored ctx 4.

 So it seems, we could ignore events like CSB[1], i.e. preempted && 
 ctx_complete && !lite_restore.

I think this bug is fixed in the series I sent last week, this is just
the race in reset vs the tasklet.

I'll try with those patches.

Thanks,

-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx