Am 26.05.21 um 18:52 schrieb Daniel Vetter:
[SNIP]
I can make the relatively simple usecases work, but it really feels
like in practice we'll end up with massive oversync in some fairly
complex usecases, and we'll regret not having had it from the start,
plus people will just rely on implicit sync for longer because it has
better (more parallel) semantics in some usecases.
Things fall apart in implicit sync if you have more than one logical
writer into the same buffer. Trivial example is two images in one
buffer, but you could also do funky stuff like interleaved/tiled
rendering with _indepedent_ consumers. If the consumers are not
independent, then you can again just stuff the two writer fences into
the exclusive slot with the new ioctl (they'll get merged without
additional overhead into one fence array fence).
And the fundamental thing is: This is just not possible with implicit
sync. There's only one fence slot (even if that resolves to an array
of fences for all the producers), so anytime you do multiple
independent things in the same buffer you either:
- must split the buffers so there's again a clear&unique handoff at
each stage of the pipeline
- or use explicit sync
Well exactly that is the problem we had with amdgpu and why we came up
with the special handling there.
And you don't even need two images in one buffer, just special hardware
which handles multiple writers gracefully is sufficient. The simplest
example is a depth buffer, but we also have things like ordered append
for ring buffers.
So in your example, options are
- per-client buffers, which you then blend into a composite buffer to
handle the N implicit fences from N buffers into a single implicit
fence for libva conversion. This single buffer then also allows you to
again fan out to M libva encoders, or whatever it is that you fancy
- explicit fencing and clients render into a single buffer with no
copying, and libva encodes from that single buffer (but again needs
explicit fences or it all comes crashing down)
There's really no option C where you somehow do multiple implicitly
fenced things into a single buffer and expect it to work out in
parallel.
You could also fallback to a dummy submission, e.g. compose the image
with multiple engines in parallel and then make a single dummy
submission to collect all the shared fences into the single exclusive fence.
But this needs an extra IOCTL and unfortunately the stack above also
needs to know when to make that dummy submission.
Christian.
-Daniel