Quoting Tvrtko Ursulin (2020-02-26 16:41:03) > > On 25/02/2020 08:22, Chris Wilson wrote: > > Use the same engine_idle_release() routine for cleaning all old > > ctx->engine[] state, closing any potential races with concurrent execbuf > > submission. > > > > Closes: https://gitlab.freedesktop.org/drm/intel/issues/1241 > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > --- > > Reorder set-closed/engine_idle_release to avoid premature killing > > Take a reference to prevent racing context free with engine cleanup > > --- > > drivers/gpu/drm/i915/gem/i915_gem_context.c | 199 +++++++++++--------- > > drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 - > > 2 files changed, 108 insertions(+), 92 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > index adcebf22a3d3..0862a77d81ed 100644 > > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > > @@ -243,7 +243,6 @@ static void __free_engines(struct i915_gem_engines *e, unsigned int count) > > if (!e->engines[count]) > > continue; > > > > - RCU_INIT_POINTER(e->engines[count]->gem_context, NULL); > > intel_context_put(e->engines[count]); > > } > > kfree(e); > > @@ -256,7 +255,11 @@ static void free_engines(struct i915_gem_engines *e) > > > > static void free_engines_rcu(struct rcu_head *rcu) > > { > > - free_engines(container_of(rcu, struct i915_gem_engines, rcu)); > > + struct i915_gem_engines *engines = > > + container_of(rcu, struct i915_gem_engines, rcu); > > + > > + i915_sw_fence_fini(&engines->fence); > > This was missing so far? Yes. Completely missed it until throwing it in a loop long enough for kmalloc recycling to catch up. And having ODEBUG enabled helps! > > +static int engines_notify(struct i915_sw_fence *fence, > > + enum i915_sw_fence_notify state) > > +{ > > + struct i915_gem_engines *engines = > > + container_of(fence, typeof(*engines), fence); > > + > > + switch (state) { > > + case FENCE_COMPLETE: > > + if (!list_empty(&engines->link)) { > > + struct i915_gem_context *ctx = engines->ctx; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&ctx->stale.lock, flags); > > + list_del(&engines->link); > > + spin_unlock_irqrestore(&ctx->stale.lock, flags); > > + } > > + break; > > + > > + case FENCE_FREE: > > + i915_gem_context_put(engines->ctx); > > This put can go under FENCE_COMPLETE? Yes. Either works, I thought it was more of a release operation. But if you would rather FENCE_FREE == just call_rcu(free_engines_rcu), I can see the elegance in that. > > + init_rcu_head(&engines->rcu); > > + call_rcu(&engines->rcu, free_engines_rcu); > > + break; > > + } > > + > > + return NOTIFY_DONE; > > +} > > + > > +static void engines_idle_release(struct i915_gem_context *ctx, > > + struct i915_gem_engines *engines) > > +{ > > + struct i915_gem_engines_iter it; > > + struct intel_context *ce; > > + > > + i915_sw_fence_init(&engines->fence, engines_notify); > > + INIT_LIST_HEAD(&engines->link); > > + > > + engines->ctx = i915_gem_context_get(ctx); > > + > > + for_each_gem_engine(ce, engines, it) { > > + int err = 0; > > + > > + RCU_INIT_POINTER(ce->gem_context, NULL); > > + > > + if (!ce->timeline) { /* XXX serialisation with execbuf? */ > > + intel_context_set_banned(ce); > > What is banned for? Banned is how we prevent further execution. The problem here is making sure we catch concurrent execbuf allocating/pinning the context. This does not and leaves a window in which between the !ce->timline and set_banned the other thread could see in with the hanging batch :| On the other hand, we don't want to mark the context as banned too early. So we unfortunately can't mark it unconditionally. > > + continue; > > + } > > + > > + mutex_lock(&ce->timeline->mutex); > > + if (!list_empty(&ce->timeline->requests)) { > > + struct i915_request *rq; > > + > > + rq = list_last_entry(&ce->timeline->requests, > > + typeof(*rq), > > + link); > > Why no more i915_active_fence_get? I was looking for something concrete with which we can serialise with execbuf, the timeline mutex is one and we can check for a late ban inside execbuf. But there's still the tiny window above. Hmm. Actually the ce->pin_mutex might work^Whelp for execbuf serialisation. Not by itself it won't though. But it should be able to close the !ce->timeline hole... -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx