Infinite waits for completion of GPU activity have been observed in CI, mostly inside __i915_active_wait(), triggered by igt@gem_barrier_race or igt@perf@stress-open-close. Root cause analysis, based of ftrace dumps generated with a lot of extra trace_printk() calls added to the code, revealed loops of request dependencies being accidentally built, preventing the reqests from being processed, each waiting for completion of another one activity. When we substitute a new request for a last active one tracked on a timeline, we set up a dependency of our new request to wait on completion of current activity of that previous one. However, not all processing paths take care of keeping the old request still in memory until we use its attributes for setting up that await dependency. As a result, we can happen to use attributes of a new request that already reuses the memory previously allocated to the old one, already released, then, set up an await dependency on wrong request. Combined with perf specific kernel context remote requests added to user context timelines, unresolvable loops of dependencies can be built, leading do infinite waits. We obtain a pointer to the previous request to wait on while we substitute it with a pointer to our new request in an active tracker. In some processing paths we protect that old request with RCU from being freed before we use it, but in others, e.g. __i915_request_commit() -> __i915_request_add_to_timeline() -> __i915_request_ensure_ordering(), we don't. Moreover, memory of released i915 requests is mostly not freed but reused, and our RCU protection doesn't prevent that memory from being reused. Protect memory of released i915 requests from being reused before RCU grace period expires. Moreover, always protect previous active i915 requests from being released while still accessing their memory. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8211 Fixes: b1e3177bd1d8 ("drm/i915: Coordinate i915_active with its own mutex") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@xxxxxxxxxxxxxxx> Cc: <stable@xxxxxxxxxxxxxxx> # v5.5+ --- drivers/gpu/drm/i915/i915_request.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 894068bb37b6f..7f14b6309db82 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -110,13 +110,11 @@ struct kmem_cache *i915_request_slab_cache(void) return slab_requests; } -static void i915_fence_release(struct dma_fence *fence) +static void i915_fence_release_rcu_cb(struct rcu_head *rcu) { + struct dma_fence *fence = container_of(rcu, typeof(*fence), rcu); struct i915_request *rq = to_request(fence); - GEM_BUG_ON(rq->guc_prio != GUC_PRIO_INIT && - rq->guc_prio != GUC_PRIO_FINI); - i915_request_free_capture_list(fetch_and_zero(&rq->capture_list)); if (rq->batch_res) { i915_vma_resource_put(rq->batch_res); @@ -174,6 +172,16 @@ static void i915_fence_release(struct dma_fence *fence) kmem_cache_free(slab_requests, rq); } +static void i915_fence_release(struct dma_fence *fence) +{ + struct i915_request *rq = to_request(fence); + + GEM_BUG_ON(rq->guc_prio != GUC_PRIO_INIT && + rq->guc_prio != GUC_PRIO_FINI); + + call_rcu(&fence->rcu, i915_fence_release_rcu_cb); +} + const struct dma_fence_ops i915_fence_ops = { .get_driver_name = i915_fence_get_driver_name, .get_timeline_name = i915_fence_get_timeline_name, @@ -1673,6 +1681,7 @@ __i915_request_ensure_ordering(struct i915_request *rq, GEM_BUG_ON(is_parallel_rq(rq)); + rcu_read_lock(); prev = to_request(__i915_active_fence_set(&timeline->last_request, &rq->fence)); @@ -1706,6 +1715,7 @@ __i915_request_ensure_ordering(struct i915_request *rq, &rq->dep, 0); } + rcu_read_unlock(); return prev; } -- 2.41.0