Quoting Tvrtko Ursulin (2019-11-22 13:43:17) > > On 22/11/2019 13:09, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2019-11-22 13:01:56) > >> > >> On 22/11/2019 11:21, Chris Wilson wrote: > >>> Before checking the current i915_active state for the asynchronous work > >>> we submitted, flush any ongoing callback. This ensures that our sampling > >>> is robust and does not sporadically fail due to bad timing as the work > >>> is running on another cpu. > >>> > >>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>> --- > >>> drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------ > >>> 1 file changed, 13 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c > >>> index 3586af636304..939798338242 100644 > >>> --- a/drivers/gpu/drm/i915/gt/selftest_context.c > >>> +++ b/drivers/gpu/drm/i915/gt/selftest_context.c > >>> @@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce) > >>> > >>> mutex_lock(&tl->mutex); > >>> do { > >>> - struct dma_fence *fence; > >>> + struct i915_request *rq; > >>> long timeout; > >>> > >>> - fence = i915_active_fence_get(&tl->last_request); > >>> - if (!fence) > >>> + if (list_empty(&tl->requests)) > >>> break; > >>> > >>> - timeout = dma_fence_wait_timeout(fence, false, HZ / 10); > >>> + rq = list_last_entry(&tl->requests, typeof(*rq), link); > >>> + i915_request_get(rq); > >>> + > >>> + timeout = i915_request_wait(rq, 0, HZ / 10); > >>> if (timeout < 0) > >>> err = timeout; > >>> else > >>> - i915_request_retire_upto(to_request(fence)); > >>> + i915_request_retire_upto(rq); > >>> > >>> - dma_fence_put(fence); > >>> + spin_lock_irq(&rq->lock); > >>> + spin_unlock_irq(&rq->lock); > >> > >> Ugh.. this at least needs a comment. > >> > >> What is the actual race? > > > > We complete the sync before the interrupt handler's irq work has > > finished executing the callback to mark the barrier as completed. > > So when we look at whether the engine is parked or not, we see the state > > before the request has finished processing and find it still active. > > You mean i915_active_fence_get returns not really last request? How can > that be? Or the key is just intel_engine_pm_flush? The active ingredient in this patch should be using the tl->requests instead of tl->last_request and then using pm_flush() Looking at what was there before: > >>> - fence = i915_active_fence_get(&tl->last_request); is definitely where I went "yikes, if we are skipping because last_request is NULL, we had better make sure that the barrier callbacks on the rq->fence.cb_list were executed" which means waiting until the interrupt handler dropped rq->lock. Looking at the code that is there now, retiring the known list of requests (and then looping the engine_park addition) should be accurate. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx