On Sat, Jan 24, 2015 at 04:08:32PM +0000, Chris Wilson wrote: > On Sat, Jan 24, 2015 at 10:41:46AM +0100, Daniel Vetter wrote: > > On Fri, Jan 23, 2015 at 6:30 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote: > > > On Fri, Jan 23, 2015 at 04:53:48PM +0100, Daniel Vetter wrote: > > >> Yeah that's kind the big behaviour difference (at least as I see it) > > >> between explicit sync and implicit sync: > > >> - with implicit sync the kernel attachs sync points/requests to buffers > > >> and userspace just asks about idle/business of buffers. Synchronization > > >> between different users is all handled behind userspace's back in the > > >> kernel. > > >> > > >> - explicit sync attaches sync points to individual bits of work and makes > > >> them explicit objects userspace can get at and pass around. Userspace > > >> uses these separate things to inquire about when something is > > >> done/idle/busy and has its own mapping between explicit sync objects and > > >> the different pieces of memory affected by each. Synchronization between > > >> different clients is handled explicitly by passing sync objects around > > >> each time some rendering is done. > > >> > > >> The bigger driver for explicit sync (besides "nvidia likes it sooooo much > > >> that everyone uses it a lot") seems to be a) shitty gpu drivers without > > >> proper bo managers (*cough*android*cough*) and svm, where there's simply > > >> no buffer objects any more to attach sync information to. > > > > > > Actually, mesa would really like much finer granularity than at batch > > > boundaries. Having a sync object for a batch boundary itself is very meh > > > and not a substantive improvement on what it possible today, but being > > > able to convert the implicit sync into an explicit fence object is > > > interesting and lends a layer of abstraction that could make it more > > > versatile. Most importantly, it allows me to defer the overhead of fence > > > creation until I actually want to sleep on a completion. Also Jesse > > > originally supporting inserting fences inside a batch, which looked > > > interesting if impractical. > > > > If want to allow the kernel to stall on fences (in e.g. the scheduler) > > only the kernel should be allowed to create fences imo. At least > > current fences assume that they _will_ signal eventually, and for i915 > > fences we have the hangcheck to ensure this is the case. In-batch > > fences and lazy fence creation (beyond just delaying the fd allocation > > to avoid too many fds flying around) is therefore a no-go. > > Lazy fence creation (i.e. attaching a fence to a buffer) just means > creating the fd for an existing request (which is derived from the > fence). Or if the buffer is read or write idle, then you just create the > fence as already-signaled. And yes, it is just to avoid the death by a > thousand file descriptors and especially creating one every batch. I think the problem will be platforms that want full explicit fence (like android) but allow delayed creation of the fence fd from a gl sync object (like the android egl extension allows). I'm not sure yet how to best expose that really since just creating a fence from the implicit request attached to the batch might upset the interface purists with the mix in implicit and explicit fencing ;-) Hence why I think for now we should just do the eager fd creation at execbuf until ppl scream (well maybe not merge this patch until ppl scream ...). > > For that kind of fine-grained sync between gpu and cpu workloads the > > solutions thus far (at least what I've seen) is just busy-looping. > > Usually those workloads have a few order more sync pionts than frames > > we tend to render, so blocking isn't terrible efficient anyway. > > Heck, I think a full-fledged fence fd per batch is still more overhead > than I want. One idea that crossed my mind is to expose the 2nd interrupt source to userspace somehow (we have pipe_control/mi_flush_dw and mi_user_interrupt). Then we could use that, maybe with some wakeupfiltering to allow userspace to block a bit more efficient. But my gut feel still says that most likely a bit of busy-looping won't hurt in such a case with very fine-grained synchronization. There should be a full blocking kernel request nearby. And often the check is for "busy or not" only anyway, and that can already be done with seqno writes from batchbuffers to a per-ctx bo that userspace manages. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx