Quoting Lionel Landwerlin (2019-08-01 10:37:23) > On 01/08/2019 12:22, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2019-08-01 10:01:44) > >> On 01/08/2019 11:08, Chris Wilson wrote: > >>> Quoting Lionel Landwerlin (2019-08-01 08:43:24) > >>>> On 31/07/2019 23:03, Chris Wilson wrote: > >>>>> Quoting Lionel Landwerlin (2019-07-31 15:07:33) > >>>>> I think I have convinced myself that with the split between wait before, > >>>>> signal after combined with the rule that seqno point along the syncobj > >>>>> are monotonic, you should not be able to generate an AB-BA deadlock > >>>>> between concurrent clients. > >>>>> > >>>>> Except that we need to fixup drm_syncobj_add_point() to actually apply > >>>>> that rule. Throwing an error and leaving the client stuck is less than > >>>>> ideal, we can't even recover with a GPU reset, and as they are native > >>>>> fences we don't employ a failsafe timer. > >>>> Unfortunately we're not the only user of add_point() and amdgpu doesn't > >>>> want it to fail. > >>> It's dangerously exposing the deadlock from incorrect seqno ordering to > >>> userspace. We should be able to hit that DRM_ERROR() in testing, and be > >>> able to detect the out-of-sequence timeline. > >>> > >>>> Best I can come up with is take the syncobj lock when signaling in > >>>> get_timeline_fence_array() do the check there, unlock in __free_fence_array. > >>> I think the intermediate step is a "safe" version of add_point that > >>> doesn't leave the timeline broken. That still leaves a glitch visible to > >>> userspace, but it should not deadlock. > >> > >> Right, sounds doable. What happens in execbuf after add_point() fails? > >> > >> We've already handed the request to the scheduler and potentially it > >> might be running already. > > Right, at present we've already submitted the request, so the batch will > > be run and fence will be signaled. The failure to add to it the > > timeline means that someone else already submitted a later fence and > > some third party is using that syncpt instead of ours for their > > dependency. So that third party will be delayed, but so long as they kept > > their dependencies in order it should just be a delay. > > > But should we return an error to userspace? No, the submission itself hasn't failed and the result of the batch and its fence will be true. It's a case where userspace lost a race with itself -- if there was a non-error warning flag, maybe. If you keep the language loose that a timeline semaphore wait will be after its syncpt has been signaled, but not necessarily before the next syncpt is signaled, unless explicitly ordered by the client. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx