On Thu, Aug 13, 2020 at 07:49:24AM +0100, Chris Wilson wrote: > Quoting Jordan Crouse (2020-08-13 00:55:44) > > This is an RFC because I'm still trying to grok the correct behavior. > > > > Consider a dma_fence_array created two two fence and signal_on_any is true. > > A reference to dma_fence_array is taken for each waiting fence. > > > > When the client calls dma_fence_wait() only one of the fences is signaled. > > The client returns successfully from the wait and puts it's reference to > > the array fence but the array fence still remains because of the remaining > > un-signaled fence. > > > > Now consider that the unsignaled fence is signaled while the timeline is being > > destroyed much later. The timeline destroy calls dma_fence_signal_locked(). The > > following sequence occurs: > > > > 1) dma_fence_array_cb_func is called > > > > 2) array->num_pending is 0 (because it was set to 1 due to signal_on_any) so the > > callback function calls dma_fence_put() instead of triggering the irq work > > > > 3) The array fence is released which in turn puts the lingering fence which is > > then released > > > > 4) deadlock with the timeline > > It's the same recursive lock as we previously resolved in sw_sync.c by > removing the locking from timeline_fence_release(). Ah, yep. I'm working on a not-quite-ready-for-primetime version of a vulkan timeline implementation for drm/msm and I was doing something similar to how sw_sync used to work in the release function. Getting rid of the recursive lock in the timeline seems a better solution than this. Thanks for taking the time to respond. Jordan > -Chris -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project