On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@xxxxxxxxx> wrote:
>
> > Yes, we can't break anything because we don't want to complicate things
> > for us. It's pretty much all NAK'd already. We are trying to gather more
> > knowledge and then make better decisions.
> >
> > The idea we are considering is that we'll expose memory-based sync objects
> > to userspace for read only, and the kernel or hw will strictly control the
> > memory writes to those sync objects. The hole in that idea is that
> > userspace can decide not to signal a job, so even if userspace can't
> > overwrite memory-based sync object states arbitrarily, it can still decide
> > not to signal them, and then a future fence is born.
> >
>
> This would actually be treated as a GPU hang caused by that context, so it
> should be fine.
This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.
Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses glFinish to sync with GL.
The gfx use case needs new hardware logic to support implicit and explicit sync. When we propose a solution, it's usually torn apart the next day by ourselves.
Since we are talking about next hw or next next hw, preemption should be better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer via a user queue, which waits for the per-buffer sequence counter in memory to be >= the number assigned by the kernel, and later unlock the access with another command, which increments the per-buffer sequence counter in memory with atomic_inc regardless of the number assigned by the kernel. The kernel counter and the counter in memory can be out-of-sync, and I'll explain why it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by that process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that signaling n will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit sync. Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.
Marek