Quoting Chris Wilson (2020-07-14 19:30:39) > Quoting Bas Nieuwenhuizen (2020-07-14 19:17:21) > > On Tue, Jul 14, 2020 at 6:26 PM Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > Quoting Bas Nieuwenhuizen (2020-07-14 16:41:02) > > > > Calltree: > > > > timeline_fence_release > > > > drm_sched_entity_wakeup > > > > dma_fence_signal_locked > > > > sync_timeline_signal > > > > sw_sync_ioctl > > > > > > > > Releasing the reference to the fence in the fence signal callback > > > > seems reasonable to me, so this patch avoids the locking issue in > > > > sw_sync. > > > > > > > > d3862e44daa7 ("dma-buf/sw-sync: Fix locking around sync_timeline lists") > > > > fixed the recursive locking issue but caused an use-after-free. Later > > > > d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free") > > > > fixed the use-after-free but reintroduced the recursive locking issue. > > > > > > > > In this attempt we avoid the use-after-free still because the release > > > > function still always locks, and outside of the locking region in the > > > > signal function we have properly refcounted references. > > > > > > > > We furthermore also avoid the recurive lock by making sure that either: > > > > > > > > 1) We have a properly refcounted reference, preventing the signal from > > > > triggering the release function inside the locked region. > > > > 2) The refcount was already zero, and hence nobody will be able to trigger > > > > the release function from the signal function. > > > > > > > > Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free") > > > > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > > > > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > > > > Cc: Gustavo Padovan <gustavo@xxxxxxxxxxx> > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > > Cc: <stable@xxxxxxxxxxxxxxx> > > > > Signed-off-by: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx> > > > > --- > > > > drivers/dma-buf/sw_sync.c | 28 ++++++++++++++++++++-------- > > > > 1 file changed, 20 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c > > > > index 348b3a9170fa..30a482f75d56 100644 > > > > --- a/drivers/dma-buf/sw_sync.c > > > > +++ b/drivers/dma-buf/sw_sync.c > > > > @@ -192,9 +192,12 @@ static const struct dma_fence_ops timeline_fence_ops = { > > > > static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc) > > > > { > > > > struct sync_pt *pt, *next; > > > > + struct list_head ref_list; > > > > > > > > trace_sync_timeline(obj); > > > > > > > > + INIT_LIST_HEAD(&ref_list); > > > > + > > > > spin_lock_irq(&obj->lock); > > > > > > > > obj->value += inc; > > > > @@ -206,18 +209,27 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc) > > > > list_del_init(&pt->link); > > > > rb_erase(&pt->node, &obj->pt_tree); > > > > > > > > - /* > > > > - * A signal callback may release the last reference to this > > > > - * fence, causing it to be freed. That operation has to be > > > > - * last to avoid a use after free inside this loop, and must > > > > - * be after we remove the fence from the timeline in order to > > > > - * prevent deadlocking on timeline->lock inside > > > > - * timeline_fence_release(). > > > > - */ > > > > + /* We need to take a reference to avoid a release during > > > > + * signalling (which can cause a recursive lock of obj->lock). > > > > + * If refcount was already zero, another thread is already taking > > > > + * care of destructing the fence, so the signal cannot release > > > > + * it again and we hence will not have the recursive lock. */ > > > > > > /* > > > * Block commentary style: > > > * https://www.kernel.org/doc/html/latest/process/coding-style.html#commenting > > > */ > > > > > > > + if (dma_fence_get_rcu(&pt->base)) > > > > + list_add_tail(&pt->link, &ref_list); > > > > > > Ok. > > > > > > > + > > > > dma_fence_signal_locked(&pt->base); > > > > } > > > > > > > > spin_unlock_irq(&obj->lock); > > > > + > > > > + list_for_each_entry_safe(pt, next, &ref_list, link) { > > > > + /* This needs to be cleared before release, otherwise the > > > > + * timeline_fence_release function gets confused about also > > > > + * removing the fence from the pt_tree. */ > > > > + list_del_init(&pt->link); > > > > + > > > > + dma_fence_put(&pt->base); > > > > + } > > > > > > How serious is the problem of one fence callback freeing another pt? > > > > > > Following the pattern here > > > > > > spin_lock(&obj->lock); > > > list_for_each_entry_safe(pt, next, &obj->pt_list, link) { > > > if (!timeline_fence_signaled(&pt->base)) > > > break; > > > > > > if (!dma_fence_get_rcu(&pt->base)) > > > continue; /* too late! */ > > > > > > rb_erase(&pt->node, &obj->pt_tree); > > > list_move_tail(&pt->link, &ref_list); > > > } > > > spin_unlock(&obj->lock); > > > > > > list_for_each_entry_safe(pt, next, &ref_list, link) { > > > list_del_init(&pt->link); > > > dma_fence_signal(&pt->base); > > > > Question is what the scope should be. Using this method we only take a > > reference and avoid releasing the fences we're going to signal. > > However the lock is on the timeline and dma_fence_signal already takes > > the lock, so if the signal ends up releasing any of the other fences > > in the timeline (e.g. the fences that are later than the new value), > > then we are still going to have the recursive locking issue. > > > > One solution to that would be splitting the lock into 2 locks: 1 > > protecting pt_list + pt_tree and 1 protecting value & serving as the > > lock for the fences. Then we can only take the list lock in the > > release function and dma_fence_signal takes the value lock. However, > > then you still have issues like if a callback for fence A (when called > > it holds the lock of A's timeline) adds a callback to fence B of the > > same timeline (requires the fence lock) we still have a recursion > > locking issue. > > > > I'm not sure it is reasonable for the last use case to work, because I > > only see it working with per-fence locks, but per-fence locks are > > deadlock prone due to the order of taking links not really being > > fixed. > > With the explicit reference you introduce for the signaler, is it not > then safe to revert d3c6dd1fb30d i.e. conditionally take the spinlock > inside release? We just have to remember to acquire the ref before > removing the pt from the list. No, still only thinking about recursing down the same set of syncpt. -Chris _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel