On Fri, 12 Mar 2021 19:25:13 +0100 Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > So where does this leave us? Well, it depends on your submit model > > and exactly how you handle pipeline barriers that sync between > > engines. If you're taking option 3 above and doing two command > > buffers for each VkCommandBuffer, then you probably want two > > serialized timelines, one for each engine, and some mechanism to tell > > the kernel driver "these two command buffers have to run in parallel" > > so that your ping-pong works. If you're doing 1 or 2 above, I think > > you probably still want two simple syncobjs, one for each engine. You > > don't really have any need to go all that far back in history. All > > you really need to describe is "command buffer X depends on previous > > compute work" or "command buffer X depends on previous binning work". > > Okay, so this will effectively force in-order execution. Let's take your > previous example and add 2 more jobs at the end that have no deps on > previous commands: > > vkBeginRenderPass() /* Writes to ImageA */ > vkCmdDraw() > vkCmdDraw() > ... > vkEndRenderPass() > vkPipelineBarrier(imageA /* fragment -> compute */) > vkCmdDispatch() /* reads imageA, writes BufferB */ > vkBeginRenderPass() /* Writes to ImageC */ > vkCmdBindVertexBuffers(bufferB) > vkCmdDraw(); > ... > vkEndRenderPass() > vkBeginRenderPass() /* Writes to ImageD */ > vkCmdDraw() > ... > vkEndRenderPass() > > A: Vertex for the first draw on the compute engine > B: Vertex for the first draw on the compute engine > C: Fragment for the first draw on the binning engine; depends on A > D: Fragment for the second draw on the binning engine; depends on B > E: Compute on the compute engine; depends on C and D > F: Vertex for the third draw on the compute engine; depends on E > G: Fragment for the third draw on the binning engine; depends on F > H: Vertex for the fourth draw on the compute engine > I: Fragment for the fourth draw on the binning engine > > When we reach E, we might be waiting for D to finish before scheduling > the job, and because of the implicit serialization we have on the > compute queue (F implicitly depends on E, and H on F) we can't schedule > H either, which could, in theory be started. I guess that's where the > term submission order is a bit unclear to me. The action of starting a > job sounds like execution order to me (the order you starts jobs > determines the execution order since we only have one HW queue per job > type). All implicit deps have been calculated when we queued the job to > the SW queue, and I thought that would be enough to meet the submission > order requirements, but I might be wrong. > > The PoC I have was trying to get rid of this explicit serialization on > the compute and fragment queues by having one syncobj timeline > (queue(<syncpoint>)) and synchronization points (Sx). > > S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point > A: in-fences=<explicit_deps>, out-fences=<queue(1)> > B: in-fences=<explicit_deps>, out-fences=<queue(2)> > C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context > D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context > E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA > F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B > G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context > H: in-fences=<explicit_deps>, out-fence=<queue(8)> > I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer > S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point > # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point > > With this solution H can be started before E if the compute slot > is empty and E's implicit deps are not done. It's probably overkill, > but I thought maximizing GPU utilization was important. Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2 syncobjs (one per queue type) is indeed the right approach. _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel