I think the kernel has problems with Android fences which were slowly broken as they were de-staged:
1. They allowed for fence/timeline value/seqno to overflow/rollover and that would break only if delta between timeline and earlier unsignaled fence exceeded 2^31 or so. Android relies on that behavior.
It was also possible to clean-up timeline by incrementing it for instance by calling timeline_inc(0x7FFFFFFF) twice. This doesn't work anymore because if no one is actively waiting for fence, active_list is empty and timeline_inc does not signal fences anymore; just increments timeline value and allows it to rollover without any effect.
2. They did guarantee that when timeline was destroyed all fences on timeline were signaled (or error state). Thus, if timeline was destroyed, or if process that owned the timeline died, processes that depended on it would not hang. Android also relies on that behavior for cleanup when a process crashes for example.
3. It seems from some stack traces that I have seen that timeline_inc signals fence from inside spinlock, which can cause fence_array to call fence_array_release, which calls flush_work()
e.g.
<4>[ 140.762030] [<ffffffff858939fb>] dump_stack+0x4d/0x63
<4>[ 140.762035] [<ffffffff8568b593>] ___might_sleep+0x149/0x14e
<4>[ 140.762039] [<ffffffff8568b637>] __might_sleep+0x9f/0xa6
<4>[ 140.762045] [<ffffffff8567ef2a>] flush_work+0x39/0x19a
<4>[ 140.762049] [<ffffffff8568eb5c>] ? try_to_wake_up+0x20b/0x21b
<4>[ 140.762055] [<ffffffff85a54cea>] fence_array_release+0x2e/0x63
<4>[ 140.762058] [<ffffffff85a53a65>] fence_release+0x82/0x8e
<4>[ 140.762061] [<ffffffff85a54cba>] fence_put+0x15/0x17
<4>[ 140.762065] [<ffffffff85a54e08>] fence_array_cb_func+0x1f/0x39
<4>[ 140.762068] [<ffffffff85a53881>] fence_signal_locked+0x8e/0xa3
<4>[ 140.762072] [<ffffffff85a55cda>] sync_timeline_signal+0xcd/0x10a
<4>[ 140.762075] [<ffffffff85a5613b>] sw_sync_ioctl+0x159/0x17f
On Wed, Jun 28, 2017 at 2:00 PM, Sean Paul <seanpaul@xxxxxxxxxxxx> wrote:
Understood, thank you for clarifying. This still doesn't solve the issue of userspaceOn Wed, Jun 28, 2017 at 08:45:55PM +0100, Chris Wilson wrote:
> Quoting Sean Paul (2017-06-28 17:47:24)
> > On Wed, Jun 28, 2017 at 05:00:20PM +0100, Chris Wilson wrote:
> > > Quoting Sean Paul (2017-06-28 16:51:11)
> > > > Protect against long-running processes from overflowing the timeline
> > > > and creating fences that go back in time. While we're at it, avoid
> > > > overflowing while we're incrementing the timeline.
> > > >
> > > > Signed-off-by: Sean Paul <seanpaul@xxxxxxxxxxxx>
> > > > ---
> > > > drivers/dma-buf/sw_sync.c | 7 ++++++-
> > > > 1 file changed, 6 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
> > > > index 69c5ff36e2f9..40934619ed88 100644
> > > > --- a/drivers/dma-buf/sw_sync.c
> > > > +++ b/drivers/dma-buf/sw_sync.c
> > > > @@ -142,7 +142,7 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
> > > >
> > > > spin_lock_irqsave(&obj->child_list_lock, flags);
> > > >
> > > > - obj->value += inc;
> > > > + obj->value += min(inc, ~0x0U - obj->value);
> > >
> > > The timeline uses u32 seqno, so just obj->value += min(inc, INT_MAX);
> > >
> > Hi Chris,
> > Thanks for the review.
> >
> > I don't think that solves the same problem I was trying to solve. The issue is
> > that android userspace increments value by 0x7fffffff twice in order to ensure
> > all fences have signaled. This is causing value to overflow and is_signaled will
> > never be true. With your snippet, the possibility of overflow still exists.
> >
> > > Better of course would be to report the error,
> >
> > AFAIK, it's not an error to jump the timeline, perhaps just bad taste. Capping
> > value at UINT_MAX will ensure all fences are signaled, and the check below ensures
> > that fences can't be created beyond that (returning an error at that point in
> > time).
>
> UINT_MAX doesn't imply all fences will be signaled either, the timeline
> is supposed to wrap.
>
> The issue is timeline_fence_signaled() is using the wrong test, it
> should be return (int)(fence->seqno - parent->value) <= 0; If it helps
> extract a little helper from dma_fence_is_later().
jumping the timeline by INT_MAX multiple times. In that case, value will rollover and
even the new signaled() will fail to report.
Sean
> -Chris
--
Sean Paul, Software Engineer, Google / Chromium OS
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel