Quoting Tvrtko Ursulin (2019-11-22 14:17:42) > > On 22/11/2019 13:55, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2019-11-22 13:43:17) > >> > >> On 22/11/2019 13:09, Chris Wilson wrote: > >>> Quoting Tvrtko Ursulin (2019-11-22 13:01:56) > >>>> > >>>> On 22/11/2019 11:21, Chris Wilson wrote: > >>>>> Before checking the current i915_active state for the asynchronous work > >>>>> we submitted, flush any ongoing callback. This ensures that our sampling > >>>>> is robust and does not sporadically fail due to bad timing as the work > >>>>> is running on another cpu. > >>>>> > >>>>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>>>> --- > >>>>> drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------ > >>>>> 1 file changed, 13 insertions(+), 6 deletions(-) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c > >>>>> index 3586af636304..939798338242 100644 > >>>>> --- a/drivers/gpu/drm/i915/gt/selftest_context.c > >>>>> +++ b/drivers/gpu/drm/i915/gt/selftest_context.c > >>>>> @@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce) > >>>>> > >>>>> mutex_lock(&tl->mutex); > >>>>> do { > >>>>> - struct dma_fence *fence; > >>>>> + struct i915_request *rq; > >>>>> long timeout; > >>>>> > >>>>> - fence = i915_active_fence_get(&tl->last_request); > >>>>> - if (!fence) > >>>>> + if (list_empty(&tl->requests)) > >>>>> break; > >>>>> > >>>>> - timeout = dma_fence_wait_timeout(fence, false, HZ / 10); > >>>>> + rq = list_last_entry(&tl->requests, typeof(*rq), link); > >>>>> + i915_request_get(rq); > >>>>> + > >>>>> + timeout = i915_request_wait(rq, 0, HZ / 10); > >>>>> if (timeout < 0) > >>>>> err = timeout; > >>>>> else > >>>>> - i915_request_retire_upto(to_request(fence)); > >>>>> + i915_request_retire_upto(rq); > >>>>> > >>>>> - dma_fence_put(fence); > >>>>> + spin_lock_irq(&rq->lock); > >>>>> + spin_unlock_irq(&rq->lock); > >>>> > >>>> Ugh.. this at least needs a comment. > >>>> > >>>> What is the actual race? > >>> > >>> We complete the sync before the interrupt handler's irq work has > >>> finished executing the callback to mark the barrier as completed. > >>> So when we look at whether the engine is parked or not, we see the state > >>> before the request has finished processing and find it still active. > >> > >> You mean i915_active_fence_get returns not really last request? How can > >> that be? Or the key is just intel_engine_pm_flush? > > > > The active ingredient in this patch should be using the tl->requests > > instead of tl->last_request and then using pm_flush() > > > > Looking at what was there before: > >>>>> - fence = i915_active_fence_get(&tl->last_request); > > > > is definitely where I went "yikes, if we are skipping because > > last_request is NULL, we had better make sure that the barrier callbacks > > on the rq->fence.cb_list were executed" which means waiting until the > > interrupt handler dropped rq->lock. > > > > Looking at the code that is there now, retiring the known list of > > requests (and then looping the engine_park addition) should be accurate. > > Hm I hate this very intimate knowledge of operation. And I am still not > following. :( Is some path bypassing adding to tl->last_request? No, the i915_request_retire_upto() here may add to tl->requests (and setting tl->last_request). But tl->last_request is unset by the interrupt callback (if there is one, and there may be on a whim). So if it is unset, we forget to loop to wait on the engine parking and so our check that the engine is parked is too early. > I see that the selftest does two things, first has i915_active_is_idle > which needs the context to be retired. Right. And context retirement needs an idle barrier, hence waiting for the kernel_context to provide the barrier on parking. This could be a wait for intel_engine_pm_is_idle instead, I choose to try and keep the wait explicit to the idle barrier, as the underlying goal here is to prove control over the idle barrier for the ordinary struct intel_context. > Second bit is intel_engine_pm_is_awake which alone would be handled by > intel_engine_gt_pm_flush. Right, so long as we sure the engine is in the process of parking (which it should be having sent the idle barrier observed above). > So the first i915_active_is_idle check is the problematic one I think. It's definitely the first one to trip over ;) -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx