On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote: > Hi, > > So i want over the whole fence and sync point stuff as it's becoming a pressing > issue. I think we first need to agree on what is the problem we want to solve > and what would be the requirements to solve it. > > Problem : > Explicit synchronization btw different hardware block over a buffer object. > > Requirements : > Share common infrastructure. > Allow optimal hardware command stream scheduling accross hardware block. > Allow android sync point to be implemented on top of it. > Handle/acknowledge exception (like good old gpu lockup). > Minimize driver changes. > > Glossary : > hardware timeline: timeline bound to a specific hardware block. > pipeline timeline: timeline bound to a userspace rendering pipeline, each > point on that timeline can be a composite of several > different hardware pipeline point. > pipeline: abstract object representing userspace application graphic pipeline > of each of the application graphic operations. > fence: specific point in a timeline where synchronization needs to happen. > > > So now, current include/linux/fence.h implementation is i believe missing the > objective by confusing hardware and pipeline timeline and by bolting fence to > buffer object while what is really needed is true and proper timeline for both > hardware and pipeline. But before going further down that road let me look at > things and explain how i see them. > > Current ttm fence have one and a sole purpose, allow synchronization for buffer > object move even thought some driver like radeon slightly abuse it and use them > for things like lockup detection. > > The new fence want to expose an api that would allow some implementation of a > timeline. For that it introduces callback and some hard requirement on what the > driver have to expose : > enable_signaling > [signaled] > wait > > Each of those have to do work inside the driver to which the fence belongs and > each of those can be call more or less from unexpected (with restriction like > outside irq) context. So we end up with thing like : > > Process 1 Process 2 Process 3 > I_A_schedule(fence0) > CI_A_F_B_signaled(fence0) > I_A_signal(fence0) > CI_B_F_A_callback(fence0) > CI_A_F_B_wait(fence0) > Lexique: > I_x in driver x (I_A == in driver A) > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B) > > So this is an happy mess everyone call everyone and this bound to get messy. > Yes i know there is all kind of requirement on what happen once a fence is > signaled. But those requirement only looks like they are trying to atone any > mess that can happen from the whole callback dance. > > While i was too seduced by the whole callback idea long time ago, i think it is > a highly dangerous path to take where the combinatorial of what could happen > are bound to explode with the increase in the number of players. > > > So now back to how to solve the problem we are trying to address. First i want > to make an observation, almost all GPU that exist today have a command ring > on to which userspace command buffer are executed and inside the command ring > you can do something like : > > if (condition) execute_command_buffer else skip_command_buffer > > where condition is a simple expression (memory_address cop value)) with cop one > of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption > that any gpu that slightly matter can do that. Those who can not should fix > there command ring processor. > > > With that in mind, i think proper solution is implementing timeline and having > fence be a timeline object with a way simpler api. For each hardware timeline > driver provide a system memory address at which the lastest signaled fence > sequence number can be read. Each fence object is uniquely associated with > both a hardware and a pipeline timeline. Each pipeline timeline have a wait > queue. > > When scheduling something that require synchronization on a hardware timeline > a fence is created and associated with the pipeline timeline and hardware > timeline. Other hardware block that need to wait on a fence can use there > command ring conditional execution to directly check the fence sequence from > the other hw block so you do optimistic scheduling. If optimistic scheduling > fails (which would be reported by hw block specific solution and hidden) then > things can fallback to software cpu wait inside what could be considered the > kernel thread of the pipeline timeline. > > > From api point of view there is no inter-driver call. All the driver needs to > do is wakeup the pipeline timeline wait_queue when things are signaled or > when things go sideway (gpu lockup). > > > So how to implement that with current driver ? Well easy. Currently we assume > implicit synchronization so all we need is an implicit pipeline timeline per > userspace process (note this do not prevent inter process synchronization). > Everytime a command buffer is submitted it is added to the implicit timeline > with the simple fence object : > > struct fence { > struct list_head list_hwtimeline; > struct list_head list_pipetimeline; > struct hw_timeline *hw_timeline; > uint64_t seq_num; > work_t timedout_work; > void *csdata; > }; > > So with set of helper function call by each of the driver command execution > ioctl you have the implicit timeline that is properly populated and each > dirver command execution get the dependency from the implicit timeline. > > > Of course to take full advantages of all flexibilities this could offer we > would need to allow userspace to create pipeline timeline and to schedule > against the pipeline timeline of there choice. We could create file for > each of the pipeline timeline and have file operation to wait/query > progress. > > Note that the gpu lockup are considered exceptional event, the implicit > timeline will probably want to continue on other job on other hardware > block but the explicit one probably will want to decide wether to continue > or abort or retry without the fault hw block. > > > I realize i am late to the party and that i should have taken a serious > look at all this long time ago. I apologize for that and if you consider > this is to late then just ignore me modulo the big warning the crazyness > that callback will introduce an how bad things bound to happen. I am not > saying that bad things can not happen with what i propose just that > because everything happen inside the process context that is the one > asking/requiring synchronization there will be not interprocess kernel > callback (a callback that was registered by one process and that is call > inside another process time slice because fence signaling is happening > inside this other process time slice). > > > Pseudo code for explicitness : > > drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) > { > struct fence *dependency[16], *fence; > int m; > > m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, > dependency, 16, &fence); > if (m < 0) > return m; > if (m >= 16) { > // alloc m and recall; > } > dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); > } > > int timeline_schedule(ptimeline, hwtimeline, timeout, > dependency, mdep, **fence) > { > // allocate fence set hw_timeline and init work > // build up list of dependency by looking at list of pending fence in > // timeline > } > > > > // If device driver schedule job hopping for all dependency to be signaled then > // it must also call this function with csdata being a copy of what needs to be > // executed once all dependency are signaled > void timeline_missed_schedule(timeline, fence, void *csdata) > { > INITWORK(fence->work, timeline_missed_schedule_worker) > fence->csdata = csdata; > schedule_delayed_work(fence->work, default_timeout) > } > > void timeline_missed_schedule_worker(work) > { > driver = driver_from_fence_hwtimeline(fence) > > // Make sure that each of the hwtimeline dependency will fire irq by > // calling a driver function. > timeline_wait_for_fence_dependency(fence); > driver->execute_cs(driver, fence); > } > > // This function is call by driver code that signal fence (could be call from > // interrupt context). It is responsabilities of device driver to call that > // function. > void timeline_signal(hwtimeline) > { > for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { > wakeup(fence->pipetimeline->wait_queue); > } > } Btw as extra note, because of implicit timeline any shared object schedule on a hw timeline must add a fence to all the implicit timeline where this object exist. Also there is no need to have a fence pointer per object. > > > Cheers, > Jérôme _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel