Re: [PATCH v2] drm/i915: Fix premature LRC unpin in GuC mode

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 20 Jan 2016 15:18:47 +0000

On Wed, Jan 20, 2016 at 02:50:47PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> 
> In GuC mode LRC pinning lifetime depends exclusively on the
> request liftime. Since that is terminated by the seqno update
> that opens up a race condition between GPU finishing writing
> out the context image and the driver unpinning the LRC.
> 
> To extend the LRC lifetime we will employ a similar approach
> to what legacy ringbuffer submission does.
> 
> We will start tracking the last submitted context per engine
> and keep it pinned until it is replaced by another one.
> 
> Note that the driver unload path is a bit fragile and could
> benefit greatly from efforts to unify the legacy and exec
> list submission code paths.
> 
> At the moment i915_gem_context_fini has special casing for the
> two which are potentialy not needed, and also depends on
> i915_gem_cleanup_ringbuffer running before itself.
> 
> v2:
>  * Move pinning into engine->emit_request and actually fix
>    the reference/unreference logic. (Chris Wilson)
> 
>  * ring->dev can be NULL on driver unload so use a different
>    route towards it.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> Issue: VIZ-4277
> Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Nick Hoath <nicholas.hoath@xxxxxxxxx>
> ---
> I cannot test this with GuC but it passes BAT with execlists
> and some real world smoke tests.
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 4 +++-
>  drivers/gpu/drm/i915/intel_lrc.c        | 9 ++++++++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index c25083c78ba7..0b419e165836 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -438,7 +438,9 @@ void i915_gem_context_fini(struct drm_device *dev)
>  	for (i = 0; i < I915_NUM_RINGS; i++) {
>  		struct intel_engine_cs *ring = &dev_priv->ring[i];
>  

This is the nasty part where the GPU still has access to the backing
pages as we release them. A hard to hit module-unload vs active GPU for
sure, but it is something that we can prevent.

The context-fini vs engine-fini ordering is also apparently tied to the
use of intel_gpu_reset() here (as it clobbers the GPU without dropping
the GEM state, causing a foul up when tearing down the engine).

If we had actually called i915_reset (and so i915_gem_reset) instead, we
should expect last_context to be NULL... Oops, Tvrtko you need to
inspect i915_gem_context_reset() as well for reseting last_context back
to NULL after the GPU reset.

As for here, I would just pencil in a plan to replace this chunk
entirely with a call to i915_reset(), or a slightly trimmed down
version and tidy up the active GEM cleanup in the process.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx