Re: Fence, timeline and android sync points

Christian König <deathsimple@xxxxxxxxxxx> · Thu, 14 Aug 2014 13:53:47 +0200

But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you 
still justify the common callback when a fence is signaled?

If I understood it right the use case for this was waiting for any fence 
of a list of fences from multiple drivers, but if each driver needs 
special handling how for it's wait how can that work reliable?

Christian.

Am 14.08.2014 um 11:15 schrieb Maarten Lankhorst:
Op 13-08-14 om 19:07 schreef Jerome Glisse:
On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,

So i want over the whole fence and sync point stuff as it's becoming a pressing
issue. I think we first need to agree on what is the problem we want to solve
and what would be the requirements to solve it.

Problem :
   Explicit synchronization btw different hardware block over a buffer object.

Requirements :
   Share common infrastructure.
   Allow optimal hardware command stream scheduling accross hardware block.
   Allow android sync point to be implemented on top of it.
   Handle/acknowledge exception (like good old gpu lockup).
   Minimize driver changes.

Glossary :
   hardware timeline: timeline bound to a specific hardware block.
   pipeline timeline: timeline bound to a userspace rendering pipeline, each
                      point on that timeline can be a composite of several
                      different hardware pipeline point.
   pipeline: abstract object representing userspace application graphic pipeline
             of each of the application graphic operations.
   fence: specific point in a timeline where synchronization needs to happen.

So now, current include/linux/fence.h implementation is i believe missing the
objective by confusing hardware and pipeline timeline and by bolting fence to
buffer object while what is really needed is true and proper timeline for both
hardware and pipeline. But before going further down that road let me look at
things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them
with buffers. We actually plan to go this way with the intel svm stuff.
Ofc for dma-buf the plan is to synchronize using such fences, but that's
somewhat orthogonal I think. At least you only talk about fences and
timeline and not dma-buf here.

Current ttm fence have one and a sole purpose, allow synchronization for buffer
object move even thought some driver like radeon slightly abuse it and use them
for things like lockup detection.

The new fence want to expose an api that would allow some implementation of a
timeline. For that it introduces callback and some hard requirement on what the
driver have to expose :
   enable_signaling
   [signaled]
   wait

Each of those have to do work inside the driver to which the fence belongs and
each of those can be call more or less from unexpected (with restriction like
outside irq) context. So we end up with thing like :

  Process 1              Process 2                   Process 3
  I_A_schedule(fence0)
                         CI_A_F_B_signaled(fence0)
                                                     I_A_signal(fence0)
                                                     CI_B_F_A_callback(fence0)
                         CI_A_F_B_wait(fence0)
Lexique:
I_x  in driver x (I_A == in driver A)
CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)

So this is an happy mess everyone call everyone and this bound to get messy.
Yes i know there is all kind of requirement on what happen once a fence is
signaled. But those requirement only looks like they are trying to atone any
mess that can happen from the whole callback dance.

While i was too seduced by the whole callback idea long time ago, i think it is
a highly dangerous path to take where the combinatorial of what could happen
are bound to explode with the increase in the number of players.

So now back to how to solve the problem we are trying to address. First i want
to make an observation, almost all GPU that exist today have a command ring
on to which userspace command buffer are executed and inside the command ring
you can do something like :

   if (condition) execute_command_buffer else skip_command_buffer

where condition is a simple expression (memory_address cop value)) with cop one
of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
that any gpu that slightly matter can do that. Those who can not should fix
there command ring processor.

With that in mind, i think proper solution is implementing timeline and having
fence be a timeline object with a way simpler api. For each hardware timeline
driver provide a system memory address at which the lastest signaled fence
sequence number can be read. Each fence object is uniquely associated with
both a hardware and a pipeline timeline. Each pipeline timeline have a wait
queue.

When scheduling something that require synchronization on a hardware timeline
a fence is created and associated with the pipeline timeline and hardware
timeline. Other hardware block that need to wait on a fence can use there
command ring conditional execution to directly check the fence sequence from
the other hw block so you do optimistic scheduling. If optimistic scheduling
fails (which would be reported by hw block specific solution and hidden) then
things can fallback to software cpu wait inside what could be considered the
kernel thread of the pipeline timeline.

 From api point of view there is no inter-driver call. All the driver needs to
do is wakeup the pipeline timeline wait_queue when things are signaled or
when things go sideway (gpu lockup).

So how to implement that with current driver ? Well easy. Currently we assume
implicit synchronization so all we need is an implicit pipeline timeline per
userspace process (note this do not prevent inter process synchronization).
Everytime a command buffer is submitted it is added to the implicit timeline
with the simple fence object :

struct fence {
   struct list_head   list_hwtimeline;
   struct list_head   list_pipetimeline;
   struct hw_timeline *hw_timeline;
   uint64_t           seq_num;
   work_t             timedout_work;
   void               *csdata;
};

So with set of helper function call by each of the driver command execution
ioctl you have the implicit timeline that is properly populated and each
dirver command execution get the dependency from the implicit timeline.

Of course to take full advantages of all flexibilities this could offer we
would need to allow userspace to create pipeline timeline and to schedule
against the pipeline timeline of there choice. We could create file for
each of the pipeline timeline and have file operation to wait/query
progress.

Note that the gpu lockup are considered exceptional event, the implicit
timeline will probably want to continue on other job on other hardware
block but the explicit one probably will want to decide wether to continue
or abort or retry without the fault hw block.

I realize i am late to the party and that i should have taken a serious
look at all this long time ago. I apologize for that and if you consider
this is to late then just ignore me modulo the big warning the crazyness
that callback will introduce an how bad things bound to happen. I am not
saying that bad things can not happen with what i propose just that
because everything happen inside the process context that is the one
asking/requiring synchronization there will be not interprocess kernel
callback (a callback that was registered by one process and that is call
inside another process time slice because fence signaling is happening
inside this other process time slice).
So I read through it all and presuming I understand it correctly your
proposal and what we currently have is about the same. The big difference
is that you make a timeline a first-class object and move the callback
queue from the fence to the timeline, which requires callers to check the
fence/seqno/whatever themselves instead of pushing that responsibility to
callers.
No, big difference is that there is no callback thus when waiting for a
fence you are either inside the process context that need to wait for it
or your inside a kernel thread process context. Which means in both case
you can do whatever you want. What i hate about the fence code as it is,
is the callback stuff, because you never know into which context fence
are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-)
The only difference is that you're allowed to have spurious wakeups and
need to handle that somehow, so need a separate check function.
No this not how wait queue are implemented, ie wait queue do not callback a
random function from a random driver, it callback a limited set of function
from core linux kernel scheduler so that the process thread that was waiting
and out of the scheduler list is added back and marked as something that
should be schedule. Unless this part of the kernel drasticly changed for the
worse recently.

So this is fundamentaly different, fence as they are now allow random driver
callback and this is bound to get ugly this is bound to lead to one driver
doing something that seems innocuous but turn out to break heavoc when call
from some other driver function.
No, really, look closer.

fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled.
The callback calls wake_up_state, which calls try_to_wake up.

default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.

Fence now has some additional checks, but originally it was implemented as a wait queue.

But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
doesn't use wait queues, but rolls its own (nouveau).

Fences also don't imply implicit sync, you can use explicit sync if you want.

I posted a patch for this, but if you want to create an android userspace fence, call

struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)

I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.

~Maarten

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel