Re: Fence, timeline and android sync points

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 13, 2014 at 04:08:14PM +0200, Christian König wrote:
> >The whole issue is that today cs ioctl assume implied synchronization. So this
> >can not change, so for now anything that goes through cs ioctl would need to
> >use an implied timeline and have all ring that use common buffer synchronize
> >on it. As long as those ring use different buffer there is no need for sync.
> Exactly my thoughts.
> 
> >Buffer object are what links hw timeline.
> A couple of people at AMD have a problem with that and I'm currently working
> full time on a solution. But solving this and keeping 100% backward
> compatibility at the same time is not an easy task.

Let me rephrase, with current cs ioctl forcing synchronization for everybuffer
that appear on different hw ring is mandatory and there is no way to fix that.

That being said one can imagine a single buffer where one engine works on a
region of it and another hw block on another non overlapping region in which
case there is no need for synchronization btw those different hw block (like
multi-gpu each rendering one half of the screen). But to properly do such thing
you need to expose timeline or something like that to userspace and have user
space emit sync on this timeline. So something like :

cs_ioctl(timeline, cs) {return csno + hwtimeline_id;}
timeline_sync(nsync, seqno[], hwtimeline_id[])

When you schedule something using a new ioctl that just take a timeline as
extra parameter you add no synchronization to timeline you assume that user
space will call timeline_sync that will insert synchronization point on the
timeline. So you can schedule bunch of cs on different hwblock, user space
keep track of last emited cs seqno and its associated hwtimeline and when
it wants to synchronize it call the timeline sync and any new cs ioctl on
that timeline will have to wait before being able to schedule.

So really i do not see how to fix that properly without a new cs ioctl that
just take an extra timeline as a parameter (well in case of radeon we can add
a timeline chunk to cs ioctl).

> 
> >Of course there might be way to be more flexible if timeline are expose to
> >userspace and userspace can create several of them for a single process.
> Concurrent execution is mostly used for temporary things e.g. copying a
> result to a userspace buffer while VCE is decoding into the ring buffer at a
> different location for example. Creating an extra timeline just to tell the
> kernel that two commands are allowed to run in parallel sounds like to much
> overhead to me.

Never was my intention to create different timeline, like said above when
scheduling with explicit timeline then each time you schedule there is no
synchronization whatsoever. Only time an engine have to wait is when user
space emit an explicit sync point. Like said above.

The allowing multi-timeline per process is more for thing where you have a
process working on two different distinct problem with no interdependency.
Hence one timeline for each of those task but inside that task you can schedule
thing concurently as said above.

> 
> Cheers,
> Christian.
> 
> Am 13.08.2014 um 15:41 schrieb Jerome Glisse:
> >On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
> >>Hi Jerome,
> >>
> >>first of all that finally sounds like somebody starts to draw the whole
> >>picture for me.
> >>
> >>So far all I have seen was a bunch of specialized requirements and some not
> >>so obvious design decisions based on those requirements.
> >>
> >>So thanks a lot for finally summarizing the requirements from a top above
> >>view and I perfectly agree with your analysis of the current fence design
> >>and the downsides of that API.
> >>
> >>Apart from that I also have some comments / requirements that hopefully can
> >>be taken into account as well:
> >>
> >>>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >>>                      point on that timeline can be a composite of several
> >>>                      different hardware pipeline point.
> >>>   pipeline: abstract object representing userspace application graphic pipeline
> >>>             of each of the application graphic operations.
> >>In the long term a requirement for the driver for AMD GFX hardware is that
> >>instead of a fixed pipeline timeline we need a bit more flexible model where
> >>concurrent execution on different hardware engines is possible as well.
> >>
> >>So the requirement is that you can do things like submitting a 3D job A, a
> >>DMA job B, a VCE job C and another 3D job D that are executed like this:
> >>     A
> >>    /  \
> >>   B  C
> >>    \  /
> >>     D
> >>
> >>(Let's just hope that looks as good on your mail client as it looked for
> >>me).
> >My thinking of hw timeline is that a gpu like amd or nvidia would have several
> >different hw timeline. They are per block/engine so one for dma ring, one for
> >gfx, one for vce, ....
> >
> >>My current thinking is that we avoid having a pipeline object in the kernel
> >>and instead letting userspace specify which fence we want to synchronize to
> >>explicitly as long as everything stays withing the same client. As soon as
> >>any buffer is shared between clients the kernel we would need to fall back
> >>to implicitly synchronization to allow backward compatibility with DRI2/3.
> >The whole issue is that today cs ioctl assume implied synchronization. So this
> >can not change, so for now anything that goes through cs ioctl would need to
> >use an implied timeline and have all ring that use common buffer synchronize
> >on it. As long as those ring use different buffer there is no need for sync.
> >
> >Buffer object are what links hw timeline.
> >
> >Of course there might be way to be more flexible if timeline are expose to
> >userspace and userspace can create several of them for a single process.
> >
> >>>   if (condition) execute_command_buffer else skip_command_buffer
> >>>
> >>>where condition is a simple expression (memory_address cop value)) with cop one
> >>>of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >>>that any gpu that slightly matter can do that. Those who can not should fix
> >>>there command ring processor.
> >>At least for some engines on AMD hardware that isn't possible (UVD, VCE and
> >>in some extends DMA as well), but I don't see any reason why we shouldn't be
> >>able to use software based scheduling on those engines by default. So this
> >>isn't really a problem, but just an additional comment to keep in mind.
> >Yes not everything can do that but as it's a simple memory access with simple
> >comparison then it's easy to do on cpu for limited hardware. But this really
> >sounds like something so easy to add to hw ring execution that it is a shame
> >hw designer do not already added such thing.
> >
> >>Regards,
> >>Christian.
> >>
> >>Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
> >>>Hi,
> >>>
> >>>So i want over the whole fence and sync point stuff as it's becoming a pressing
> >>>issue. I think we first need to agree on what is the problem we want to solve
> >>>and what would be the requirements to solve it.
> >>>
> >>>Problem :
> >>>   Explicit synchronization btw different hardware block over a buffer object.
> >>>
> >>>Requirements :
> >>>   Share common infrastructure.
> >>>   Allow optimal hardware command stream scheduling accross hardware block.
> >>>   Allow android sync point to be implemented on top of it.
> >>>   Handle/acknowledge exception (like good old gpu lockup).
> >>>   Minimize driver changes.
> >>>
> >>>Glossary :
> >>>   hardware timeline: timeline bound to a specific hardware block.
> >>>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >>>                      point on that timeline can be a composite of several
> >>>                      different hardware pipeline point.
> >>>   pipeline: abstract object representing userspace application graphic pipeline
> >>>             of each of the application graphic operations.
> >>>   fence: specific point in a timeline where synchronization needs to happen.
> >>>
> >>>
> >>>So now, current include/linux/fence.h implementation is i believe missing the
> >>>objective by confusing hardware and pipeline timeline and by bolting fence to
> >>>buffer object while what is really needed is true and proper timeline for both
> >>>hardware and pipeline. But before going further down that road let me look at
> >>>things and explain how i see them.
> >>>
> >>>Current ttm fence have one and a sole purpose, allow synchronization for buffer
> >>>object move even thought some driver like radeon slightly abuse it and use them
> >>>for things like lockup detection.
> >>>
> >>>The new fence want to expose an api that would allow some implementation of a
> >>>timeline. For that it introduces callback and some hard requirement on what the
> >>>driver have to expose :
> >>>   enable_signaling
> >>>   [signaled]
> >>>   wait
> >>>
> >>>Each of those have to do work inside the driver to which the fence belongs and
> >>>each of those can be call more or less from unexpected (with restriction like
> >>>outside irq) context. So we end up with thing like :
> >>>
> >>>  Process 1              Process 2                   Process 3
> >>>  I_A_schedule(fence0)
> >>>                         CI_A_F_B_signaled(fence0)
> >>>                                                     I_A_signal(fence0)
> >>>                                                     CI_B_F_A_callback(fence0)
> >>>                         CI_A_F_B_wait(fence0)
> >>>Lexique:
> >>>I_x  in driver x (I_A == in driver A)
> >>>CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> >>>
> >>>So this is an happy mess everyone call everyone and this bound to get messy.
> >>>Yes i know there is all kind of requirement on what happen once a fence is
> >>>signaled. But those requirement only looks like they are trying to atone any
> >>>mess that can happen from the whole callback dance.
> >>>
> >>>While i was too seduced by the whole callback idea long time ago, i think it is
> >>>a highly dangerous path to take where the combinatorial of what could happen
> >>>are bound to explode with the increase in the number of players.
> >>>
> >>>
> >>>So now back to how to solve the problem we are trying to address. First i want
> >>>to make an observation, almost all GPU that exist today have a command ring
> >>>on to which userspace command buffer are executed and inside the command ring
> >>>you can do something like :
> >>>
> >>>   if (condition) execute_command_buffer else skip_command_buffer
> >>>
> >>>where condition is a simple expression (memory_address cop value)) with cop one
> >>>of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >>>that any gpu that slightly matter can do that. Those who can not should fix
> >>>there command ring processor.
> >>>
> >>>
> >>>With that in mind, i think proper solution is implementing timeline and having
> >>>fence be a timeline object with a way simpler api. For each hardware timeline
> >>>driver provide a system memory address at which the lastest signaled fence
> >>>sequence number can be read. Each fence object is uniquely associated with
> >>>both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> >>>queue.
> >>>
> >>>When scheduling something that require synchronization on a hardware timeline
> >>>a fence is created and associated with the pipeline timeline and hardware
> >>>timeline. Other hardware block that need to wait on a fence can use there
> >>>command ring conditional execution to directly check the fence sequence from
> >>>the other hw block so you do optimistic scheduling. If optimistic scheduling
> >>>fails (which would be reported by hw block specific solution and hidden) then
> >>>things can fallback to software cpu wait inside what could be considered the
> >>>kernel thread of the pipeline timeline.
> >>>
> >>>
> >>> From api point of view there is no inter-driver call. All the driver needs to
> >>>do is wakeup the pipeline timeline wait_queue when things are signaled or
> >>>when things go sideway (gpu lockup).
> >>>
> >>>
> >>>So how to implement that with current driver ? Well easy. Currently we assume
> >>>implicit synchronization so all we need is an implicit pipeline timeline per
> >>>userspace process (note this do not prevent inter process synchronization).
> >>>Everytime a command buffer is submitted it is added to the implicit timeline
> >>>with the simple fence object :
> >>>
> >>>struct fence {
> >>>   struct list_head   list_hwtimeline;
> >>>   struct list_head   list_pipetimeline;
> >>>   struct hw_timeline *hw_timeline;
> >>>   uint64_t           seq_num;
> >>>   work_t             timedout_work;
> >>>   void               *csdata;
> >>>};
> >>>
> >>>So with set of helper function call by each of the driver command execution
> >>>ioctl you have the implicit timeline that is properly populated and each
> >>>dirver command execution get the dependency from the implicit timeline.
> >>>
> >>>
> >>>Of course to take full advantages of all flexibilities this could offer we
> >>>would need to allow userspace to create pipeline timeline and to schedule
> >>>against the pipeline timeline of there choice. We could create file for
> >>>each of the pipeline timeline and have file operation to wait/query
> >>>progress.
> >>>
> >>>Note that the gpu lockup are considered exceptional event, the implicit
> >>>timeline will probably want to continue on other job on other hardware
> >>>block but the explicit one probably will want to decide wether to continue
> >>>or abort or retry without the fault hw block.
> >>>
> >>>
> >>>I realize i am late to the party and that i should have taken a serious
> >>>look at all this long time ago. I apologize for that and if you consider
> >>>this is to late then just ignore me modulo the big warning the crazyness
> >>>that callback will introduce an how bad things bound to happen. I am not
> >>>saying that bad things can not happen with what i propose just that
> >>>because everything happen inside the process context that is the one
> >>>asking/requiring synchronization there will be not interprocess kernel
> >>>callback (a callback that was registered by one process and that is call
> >>>inside another process time slice because fence signaling is happening
> >>>inside this other process time slice).
> >>>
> >>>
> >>>Pseudo code for explicitness :
> >>>
> >>>drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> >>>{
> >>>    struct fence *dependency[16], *fence;
> >>>    int m;
> >>>
> >>>    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
> >>>                          dependency, 16, &fence);
> >>>    if (m < 0)
> >>>      return m;
> >>>    if (m >= 16) {
> >>>        // alloc m and recall;
> >>>    }
> >>>    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> >>>}
> >>>
> >>>int timeline_schedule(ptimeline, hwtimeline, timeout,
> >>>                        dependency, mdep, **fence)
> >>>{
> >>>    // allocate fence set hw_timeline and init work
> >>>    // build up list of dependency by looking at list of pending fence in
> >>>    // timeline
> >>>}
> >>>
> >>>
> >>>
> >>>// If device driver schedule job hopping for all dependency to be signaled then
> >>>// it must also call this function with csdata being a copy of what needs to be
> >>>// executed once all dependency are signaled
> >>>void timeline_missed_schedule(timeline, fence, void *csdata)
> >>>{
> >>>    INITWORK(fence->work, timeline_missed_schedule_worker)
> >>>    fence->csdata = csdata;
> >>>    schedule_delayed_work(fence->work, default_timeout)
> >>>}
> >>>
> >>>void timeline_missed_schedule_worker(work)
> >>>{
> >>>    driver = driver_from_fence_hwtimeline(fence)
> >>>
> >>>    // Make sure that each of the hwtimeline dependency will fire irq by
> >>>    // calling a driver function.
> >>>    timeline_wait_for_fence_dependency(fence);
> >>>    driver->execute_cs(driver, fence);
> >>>}
> >>>
> >>>// This function is call by driver code that signal fence (could be call from
> >>>// interrupt context). It is responsabilities of device driver to call that
> >>>// function.
> >>>void timeline_signal(hwtimeline)
> >>>{
> >>>   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
> >>>     wakeup(fence->pipetimeline->wait_queue);
> >>>   }
> >>>}
> >>>
> >>>
> >>>Cheers,
> >>>Jérôme
> >>_______________________________________________
> >>dri-devel mailing list
> >>dri-devel@xxxxxxxxxxxxxxxxxxxxx
> >>http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel





[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux