Quoting Tvrtko Ursulin (2020-11-18 11:38:43) > > On 18/11/2020 11:24, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2020-11-18 11:05:24) > >> > >> On 17/11/2020 11:30, Chris Wilson wrote: > >>> Since preempt-to-busy, we may unsubmit a request while it is still on > >>> the HW and completes asynchronously. That means it may be retired and in > >>> the process destroy the virtual engine (as the user has closed their > >>> context), but that engine may still be holding onto the unsubmitted > >>> compelted request. Therefore we need to potentially cleanup the old > >>> request on destroying the virtual engine. We also have to keep the > >>> virtual_engine alive until after the sibling's execlists_dequeue() have > >>> finished peeking into the virtual engines, for which we serialise with > >>> RCU. > >>> > >>> v2: Be paranoid and flush the tasklet as well. > >>> v3: And flush the tasklet before the engines, as the tasklet may > >>> re-attach an rb_node after our removal from the siblings. > >>> > >>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>> --- > >>> drivers/gpu/drm/i915/gt/intel_lrc.c | 61 +++++++++++++++++++++++++---- > >>> 1 file changed, 54 insertions(+), 7 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > >>> index 17cb7060eb29..c11433884cf6 100644 > >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > >>> @@ -182,6 +182,7 @@ > >>> struct virtual_engine { > >>> struct intel_engine_cs base; > >>> struct intel_context context; > >>> + struct rcu_work rcu; > >>> > >>> /* > >>> * We allow only a single request through the virtual engine at a time > >>> @@ -5470,44 +5471,90 @@ static struct list_head *virtual_queue(struct virtual_engine *ve) > >>> return &ve->base.execlists.default_priolist.requests[0]; > >>> } > >>> > >>> -static void virtual_context_destroy(struct kref *kref) > >>> +static void rcu_virtual_context_destroy(struct work_struct *wrk) > >>> { > >>> struct virtual_engine *ve = > >>> - container_of(kref, typeof(*ve), context.ref); > >>> + container_of(wrk, typeof(*ve), rcu.work); > >>> unsigned int n; > >>> > >>> - GEM_BUG_ON(!list_empty(virtual_queue(ve))); > >>> - GEM_BUG_ON(ve->request); > >>> GEM_BUG_ON(ve->context.inflight); > >>> > >>> + /* Preempt-to-busy may leave a stale request behind. */ > >>> + if (unlikely(ve->request)) { > >>> + struct i915_request *old; > >>> + > >>> + spin_lock_irq(&ve->base.active.lock); > >>> + > >>> + old = fetch_and_zero(&ve->request); > >>> + if (old) { > >>> + GEM_BUG_ON(!i915_request_completed(old)); > >>> + __i915_request_submit(old); > >>> + i915_request_put(old); > >>> + } > >>> + > >>> + spin_unlock_irq(&ve->base.active.lock); > >>> + } > >>> + > >>> + /* > >>> + * Flush the tasklet in case it is still running on another core. > >>> + * > >>> + * This needs to be done before we remove ourselves from the siblings' > >>> + * rbtrees as in the case it is running in parallel, it may reinsert > >>> + * the rb_node into a sibling. > >>> + */ > >>> + tasklet_kill(&ve->base.execlists.tasklet); > >> > >> Can it still be running after an RCU period? > > > > I think there is a window between checking to see if the request is > > completed and kicking the tasklet, that is not under the rcu lock and > > opportunity for the request to be retired, and barrier flushed to drop > > the context references. > > From where would that check come? The window of opportunity extends all the way from the i915_request_completed check during unsubmit right until the virtual engine tasklet is executed -- we do not hold a reference to the virtual engine for the tasklet, and that request may be retired in the background, and along with it the virtual engine destroyed. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx