Re: [Intel-gfx] [PATCH] drm/i915/ringbuffer: Delay after EMIT_INVALIDATE for gen4/gen5

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 07 Nov 2018 15:12:18 +0000

Quoting Ville Syrjälä (2018-11-07 15:04:24)
> On Mon, Nov 05, 2018 at 09:43:05AM +0000, Chris Wilson wrote:
> > Exercising the gpu reloc path strenuously revealed an issue where the
> > updated relocations (from MI_STORE_DWORD_IMM) were not being observed
> > upon execution. After some experiments with adding pipecontrols (a lot
> > of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe
> > controls or even the current on), it was discovered that we merely
> > needed to delay the EMIT_INVALIDATE by several flushes. It is important
> > to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that
> > needs the delay as opposed to what one might first expect -- that the
> > delay is required for the TLB invalidation to take effect (one presumes
> > to purge any CS buffers) as opposed to a delay after flushing to ensure
> > the writes have landed before triggering invalidation.
> > 
> > Testcase: igt/gem_tiled_fence_blits
> > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: stable@xxxxxxxxxxxxxxx
> > ---
> >  drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++--
> >  1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index b8a7a014d46d..87eebc13c0d8 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -91,6 +91,7 @@ static int
> >  gen4_render_ring_flush(struct i915_request *rq, u32 mode)
> >  {
> >       u32 cmd, *cs;
> > +     int i;
> >  
> >       /*
> >        * read/write caches:
> > @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode)
> >                       cmd |= MI_INVALIDATE_ISP;
> >       }
> >  
> > -     cs = intel_ring_begin(rq, 2);
> > +     i = 2;
> > +     if (mode & EMIT_INVALIDATE)
> > +             i += 20;
> > +
> > +     cs = intel_ring_begin(rq, i);
> >       if (IS_ERR(cs))
> >               return PTR_ERR(cs);
> >  
> >       *cs++ = cmd;
> > -     *cs++ = MI_NOOP;
> > +
> > +     /*
> > +      * A random delay to let the CS invalidate take effect? Without this
> > +      * delay, the GPU relocation path fails as the CS does not see
> > +      * the updated contents. Just as important, if we apply the flushes
> > +      * to the EMIT_FLUSH branch (i.e. immediately after the relocation
> > +      * write and before the invalidate on the next batch), the relocations
> > +      * still fail. This implies that is a delay following invalidation
> > +      * that is required to reset the caches as opposed to a delay to
> > +      * ensure the memory is written.
> > +      */
> > +     if (mode & EMIT_INVALIDATE) {
> > +             *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
> > +             *cs++ = i915_ggtt_offset(rq->engine->scratch) |
> > +                     PIPE_CONTROL_GLOBAL_GTT;
> > +             *cs++ = 0;
> > +             *cs++ = 0;
> > +
> > +             for (i = 0; i < 12; i++)
> > +                     *cs++ = MI_FLUSH;
> > +
> > +             *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
> > +             *cs++ = i915_ggtt_offset(rq->engine->scratch) |
> > +                     PIPE_CONTROL_GLOBAL_GTT;
> > +             *cs++ = 0;
> > +             *cs++ = 0;
> > +     }
> 
> This smells a lot like the snb a/b w/a, except there the spec says to
> use 8 STORE_DWORDS.

Yeah, the similarity wasn't lost, except that w/a is to cover the
coherency aspect of the writes not being flushed. This feels a bit
fishier in that the experiments indicate it's an issue on the
invalidation path as opposed to flushing the writes.

And the other w/a to use umpteen pipecontrols to get around the lack of
PIPE_CONTROL_FLUSH.

> I suppose the choice of a specific command isn't
> critical, and it's just a matter of stuffing the pipeline with something
> that's takes long enough to let the TLB invalidate finish?

Except the MI_FLUSH are more effective in fewer number than
PIPE_CONTROLs. Probably because each one translates to a few pipe
controls or something, quite heavy.

> Anyways, patch itself seems as reasonable as one might expect for an
> issue like this.

As nasty as one would expect.

For the record, we are not entirely out of danger. gem_exec_whisper
continues to indicate a problem, but one step at a time. (I haven't yet
found quite what's upsetting it yet, except if we do each batch
synchronously and verify each one, it's happy.)
-Chris