Re: [PATCH 2/5] drm/i915/selftests: Flush the active callbacks

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 22 Nov 2019 14:17:42 +0000

On 22/11/2019 13:55, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-11-22 13:43:17)

On 22/11/2019 13:09, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2019-11-22 13:01:56)

On 22/11/2019 11:21, Chris Wilson wrote:
Before checking the current i915_active state for the asynchronous work
we submitted, flush any ongoing callback. This ensures that our sampling
is robust and does not sporadically fail due to bad timing as the work
is running on another cpu.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
    drivers/gpu/drm/i915/gt/selftest_context.c | 19 +++++++++++++------
    1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c
index 3586af636304..939798338242 100644
--- a/drivers/gpu/drm/i915/gt/selftest_context.c
+++ b/drivers/gpu/drm/i915/gt/selftest_context.c
@@ -48,20 +48,25 @@ static int context_sync(struct intel_context *ce)
    
        mutex_lock(&tl->mutex);
        do {
-             struct dma_fence *fence;
+             struct i915_request *rq;
                long timeout;
    
-             fence = i915_active_fence_get(&tl->last_request);
-             if (!fence)
+             if (list_empty(&tl->requests))
                        break;
    
-             timeout = dma_fence_wait_timeout(fence, false, HZ / 10);
+             rq = list_last_entry(&tl->requests, typeof(*rq), link);
+             i915_request_get(rq);
+
+             timeout = i915_request_wait(rq, 0, HZ / 10);
                if (timeout < 0)
                        err = timeout;
                else
-                     i915_request_retire_upto(to_request(fence));
+                     i915_request_retire_upto(rq);
    
-             dma_fence_put(fence);
+             spin_lock_irq(&rq->lock);
+             spin_unlock_irq(&rq->lock);

Ugh.. this at least needs a comment.

What is the actual race?

We complete the sync before the interrupt handler's irq work has
finished executing the callback to mark the barrier as completed.
So when we look at whether the engine is parked or not, we see the state
before the request has finished processing and find it still active.

You mean i915_active_fence_get returns not really last request? How can
that be? Or the key is just intel_engine_pm_flush?

The active ingredient in this patch should be using the tl->requests
instead of tl->last_request and then using pm_flush()

Looking at what was there before:
-             fence = i915_active_fence_get(&tl->last_request);

is definitely where I went "yikes, if we are skipping because
last_request is NULL, we had better make sure that the barrier callbacks
on the rq->fence.cb_list were executed" which means waiting until the
interrupt handler dropped rq->lock.

Looking at the code that is there now, retiring the known list of
requests (and then looping the engine_park addition) should be accurate.

Hm I hate this very intimate knowledge of operation. And I am still not 
following. :( Is some path bypassing adding to tl->last_request?

I see that the selftest does two things, first has i915_active_is_idle 
which needs the context to be retired.

Second bit is intel_engine_pm_is_awake which alone would be handled by 
intel_engine_gt_pm_flush.

So the first i915_active_is_idle check is the problematic one I think.

Regards,

Tvrtko






_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx