Hi Andy, Thanks for review. On Thursday, 2 March 2023 01:42:05 CET Andi Shyti wrote: > Hi Janusz, > > On Sat, Feb 25, 2023 at 11:12:18PM +0100, Janusz Krzysztofik wrote: > > Users reported oopses on list corruptions when using i915 perf with a > > number of concurrently running graphics applications. Root cause analysis > > pointed at an issue in barrier processing code -- a race among perf open / > > close replacing active barriers with perf requests on kernel context and > > concurrent barrier preallocate / acquire operations performed during user > > context first pin / last unpin. > > > > When adding a request to a composite tracker, we try to reuse an existing > > fence tracker, already allocated and registered with that composite. The > > tracker we obtain may already track another fence, may be an idle barrier, > > or an active barrier. > > > > If the tracker we get occurs a non-idle barrier then we try to delete that > > barrier from a list of barrier tasks it belongs to. However, while doing > > that we don't respect return value from a function that performs the > > barrier deletion. Should the deletion ever failed, we would end up > > reusing the tracker still registered as a barrier task. Since the same > > structure field is reused with both fence callback lists and barrier > > tasks list, list corruptions would likely occur. > > > > Barriers are now deleted from a barrier tasks list by temporarily removing > > the list content, traversing that content with skip over the node to be > > deleted, then populating the list back with the modified content. Should > > that intentionally racy concurrent deletion attempts be not serialized, > > one or more of those may fail because of the list being temporary empty. > > > > Related code that ignores the results of barrier deletion was initially > > introduced in v5.4 by commit d8af05ff38ae ("drm/i915: Allow sharing the > > idle-barrier from other kernel requests"). However, all users of the > > barrier deletion routine were apparently serialized at that time, then the > > issue didn't exhibit itself. Results of git bisect with help of a newly > > developed igt@gem_barrier_race@remote-request IGT test indicate that list > > corruptions might start to appear after commit 311770173fac ("drm/i915/gt: > > Schedule request retirement when timeline idles"), introduced in v5.5. > > > > Respect results of barrier deletion attempts -- mark the barrier as idle > > only if successfully deleted from the list. Then, before proceeding with > > setting our fence as the one currently tracked, make sure that the tracker > > we've got is not a non-idle barrier. If that check fails then don't use > > that tracker but go back and try to acquire a new, usable one. > > > > v2: no code changes, > > - blame commit 311770173fac ("drm/i915/gt: Schedule request retirement > > when timeline idles"), v5.5, not commit d8af05ff38ae ("drm/i915: Allow > > sharing the idle-barrier from other kernel requests"), v5.4, > > - reword commit description. > > That's a very good explanation and very much needed for such a > catch. Thanks! > > > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6333 > > Fixes: 311770173fac ("drm/i915/gt: Schedule request retirement when timeline idles") > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx # v5.5 > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@xxxxxxxxxxxxxxx> > > --- > > drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++----------- > > 1 file changed, 14 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/ i915_active.c > > index 7412abf166a8c..f9282b8c87c1c 100644 > > --- a/drivers/gpu/drm/i915/i915_active.c > > +++ b/drivers/gpu/drm/i915/i915_active.c > > @@ -422,12 +422,12 @@ replace_barrier(struct i915_active *ref, struct i915_active_fence *active) > > * we can use it to substitute for the pending idle-barrer > > * request that we want to emit on the kernel_context. > > */ > > - __active_del_barrier(ref, node_from_active(active)); > > - return true; > > + return __active_del_barrier(ref, node_from_active(active)); > > In general, I support the idea of always checking the return > value, even if we expect a certain outcome. In these cases, the > likely/unlikely macros can be helpful. OK. > Given this change, I > believe the patch deserves an ack. > > That being said, I was curious whether using an explicit lock > and a normal list of active barriers, rather than a lockless > list, could have solved the problem. It seems like using a > lockless list and iterating over it could be overkill, unless > there are specific scenarios where the lockless properties are > necessary. > > Of course, this may be something to consider in a future cleanup, > as it may be outside the scope of this particular patch. Yes, I think so. > > > } > > > > int i915_active_add_request(struct i915_active *ref, struct i915_request *rq) > > { > > + u64 idx = i915_request_timeline(rq)->fence_context; > > struct dma_fence *fence = &rq->fence; > > struct i915_active_fence *active; > > int err; > > @@ -437,16 +437,19 @@ int i915_active_add_request(struct i915_active *ref, struct i915_request *rq) > > if (err) > > return err; > > > > - active = active_instance(ref, i915_request_timeline(rq)- >fence_context); > > - if (!active) { > > - err = -ENOMEM; > > - goto out; > > - } > > + do { > > + active = active_instance(ref, idx); > > + if (!active) { > > + err = -ENOMEM; > > + goto out; > > + } > > + > > + if (replace_barrier(ref, active)) { > > + RCU_INIT_POINTER(active->fence, NULL); > > + atomic_dec(&ref->count); > > + } > > + } while (is_barrier(active)); > > unlikely()? OK, please expect v3 with this added. Thanks, Janusz > > It's worth noting that for each iteration, we are rebuilding the > list of barriers. Therefore, I believe it may be necessary to > perform a cleanup within the replace_barrier() function. > > Thanks, > Andi > > > > > - if (replace_barrier(ref, active)) { > > - RCU_INIT_POINTER(active->fence, NULL); > > - atomic_dec(&ref->count); > > - } > > if (!__i915_active_fence_set(active, fence)) > > __i915_active_acquire(ref); > > >