Quoting Ville Syrjälä (2018-11-07 15:04:24) > On Mon, Nov 05, 2018 at 09:43:05AM +0000, Chris Wilson wrote: > > Exercising the gpu reloc path strenuously revealed an issue where the > > updated relocations (from MI_STORE_DWORD_IMM) were not being observed > > upon execution. After some experiments with adding pipecontrols (a lot > > of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe > > controls or even the current on), it was discovered that we merely > > needed to delay the EMIT_INVALIDATE by several flushes. It is important > > to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that > > needs the delay as opposed to what one might first expect -- that the > > delay is required for the TLB invalidation to take effect (one presumes > > to purge any CS buffers) as opposed to a delay after flushing to ensure > > the writes have landed before triggering invalidation. > > > > Testcase: igt/gem_tiled_fence_blits > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx > > --- > > drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++-- > > 1 file changed, 36 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c > > index b8a7a014d46d..87eebc13c0d8 100644 > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > > @@ -91,6 +91,7 @@ static int > > gen4_render_ring_flush(struct i915_request *rq, u32 mode) > > { > > u32 cmd, *cs; > > + int i; > > > > /* > > * read/write caches: > > @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode) > > cmd |= MI_INVALIDATE_ISP; > > } > > > > - cs = intel_ring_begin(rq, 2); > > + i = 2; > > + if (mode & EMIT_INVALIDATE) > > + i += 20; > > + > > + cs = intel_ring_begin(rq, i); > > if (IS_ERR(cs)) > > return PTR_ERR(cs); > > > > *cs++ = cmd; > > - *cs++ = MI_NOOP; > > + > > + /* > > + * A random delay to let the CS invalidate take effect? Without this > > + * delay, the GPU relocation path fails as the CS does not see > > + * the updated contents. Just as important, if we apply the flushes > > + * to the EMIT_FLUSH branch (i.e. immediately after the relocation > > + * write and before the invalidate on the next batch), the relocations > > + * still fail. This implies that is a delay following invalidation > > + * that is required to reset the caches as opposed to a delay to > > + * ensure the memory is written. > > + */ > > + if (mode & EMIT_INVALIDATE) { > > + *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE; > > + *cs++ = i915_ggtt_offset(rq->engine->scratch) | > > + PIPE_CONTROL_GLOBAL_GTT; > > + *cs++ = 0; > > + *cs++ = 0; > > + > > + for (i = 0; i < 12; i++) > > + *cs++ = MI_FLUSH; > > + > > + *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE; > > + *cs++ = i915_ggtt_offset(rq->engine->scratch) | > > + PIPE_CONTROL_GLOBAL_GTT; > > + *cs++ = 0; > > + *cs++ = 0; > > + } > > This smells a lot like the snb a/b w/a, except there the spec says to > use 8 STORE_DWORDS. Yeah, the similarity wasn't lost, except that w/a is to cover the coherency aspect of the writes not being flushed. This feels a bit fishier in that the experiments indicate it's an issue on the invalidation path as opposed to flushing the writes. And the other w/a to use umpteen pipecontrols to get around the lack of PIPE_CONTROL_FLUSH. > I suppose the choice of a specific command isn't > critical, and it's just a matter of stuffing the pipeline with something > that's takes long enough to let the TLB invalidate finish? Except the MI_FLUSH are more effective in fewer number than PIPE_CONTROLs. Probably because each one translates to a few pipe controls or something, quite heavy. > Anyways, patch itself seems as reasonable as one might expect for an > issue like this. As nasty as one would expect. For the record, we are not entirely out of danger. gem_exec_whisper continues to indicate a problem, but one step at a time. (I haven't yet found quite what's upsetting it yet, except if we do each batch synchronously and verify each one, it's happy.) -Chris