Re: [PATCH v3 2/2] drm/i915: add syncobj timeline support

Lionel Landwerlin <lionel.g.landwerlin@xxxxxxxxx> · Thu, 1 Aug 2019 12:37:23 +0300

On 01/08/2019 12:22, Chris Wilson wrote:
Quoting Lionel Landwerlin (2019-08-01 10:01:44)
On 01/08/2019 11:08, Chris Wilson wrote:
Quoting Lionel Landwerlin (2019-08-01 08:43:24)
On 31/07/2019 23:03, Chris Wilson wrote:
Quoting Lionel Landwerlin (2019-07-31 15:07:33)
I think I have convinced myself that with the split between wait before,
signal after combined with the rule that seqno point along the syncobj
are monotonic, you should not be able to generate an AB-BA deadlock
between concurrent clients.

Except that we need to fixup drm_syncobj_add_point() to actually apply
that rule. Throwing an error and leaving the client stuck is less than
ideal, we can't even recover with a GPU reset, and as they are native
fences we don't employ a failsafe timer.
Unfortunately we're not the only user of add_point() and amdgpu doesn't
want it to fail.
It's dangerously exposing the deadlock from incorrect seqno ordering to
userspace. We should be able to hit that DRM_ERROR() in testing, and be
able to detect the out-of-sequence timeline.

Best I can come up with is take the syncobj lock when signaling in
get_timeline_fence_array() do the check there, unlock in __free_fence_array.
I think the intermediate step is a "safe" version of add_point that
doesn't leave the timeline broken. That still leaves a glitch visible to
userspace, but it should not deadlock.

Right, sounds doable. What happens in execbuf after add_point() fails?

We've already handed the request to the scheduler and potentially it
might be running already.
Right, at present we've already submitted the request, so the batch will
be run and fence will be signaled. The failure to add to it the
timeline means that someone else already submitted a later fence and
some third party is using that syncpt instead of ours for their
dependency. So that third party will be delayed, but so long as they kept
their dependencies in order it should just be a delay.

But should we return an error to userspace?

-Lionel

The problem comes into play if we go ahead and insert our fence into the
dma_fence_chain out of order, breaking all the monotonic guarantees and
flipping the order of the fences. (The equivalent to taking mutexes out
of order.)

The way I was looking at it was to reserve a placeholder syncpt before
building the request and allow replacing the placeholder afterwards.

That sounds fairly tricky to get right :(
Invasive, yeah, turns out one needs to start walking the fence chain
more carefully. The actual placeholder fence is pretty run of the mill
as far as dma-fence*.c goes, which is say...

The fence stuff is already full of magic.
It's full of magic.
-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx