On Fri, Oct 7, 2011 at 9:38 AM, Jerome Glisse <j.glisse@xxxxxxxxx> wrote: > On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom <thomas@xxxxxxxxxxxx> wrote: >> On 10/07/2011 12:42 AM, Marek Olšák wrote: >>> >>> On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstrom<thomas@xxxxxxxxxxxx> >>> wrote: >>> >>>> >>>> In any case, I'm not saying fences is the best way to flush but since the >>>> bo >>>> code assumes that signaling a sync object means "make the buffer contents >>>> available for CPU read / write", it's usually a good way to do it; >>>> there's >>>> even a sync_obj_flush() method that gets called when a potential flush >>>> needs >>>> to happen. >>>> >>> >>> I don't think we use it like that. To my knowledge, the purpose of the >>> sync obj (to Radeon Gallium drivers at least) is to be able to wait >>> for the last use of a buffer. Whether the contents can or cannot be >>> available to the CPU is totally irrelevant. >>> >>> Currently (and it's a very important performance optimization), >>> buffers stay mapped and available for CPU read/write during their >>> first map_buffer call. Unmap_buffer is a no-op. The unmapping happens >>> on buffer destruction. We only call bo_wait when we want to wait for >>> the GPU until it's done with the buffer (we don't always want that, >>> sometimes we want to use the unsychronized flag). Otherwise the >>> contents of buffers are available at *any time*. >>> >>> We could probably implement bo_wait privately in the kernel driver and >>> not use ttm_bo_wait. I preferred code sharing though. >>> >>> Textures (especially the tiled ones) are never mapped directly and a >>> temporary staging resource is used instead, so we don't actually >>> pollute address space that much. (in case you would have such a >>> remark) We will use staging resources for buffers too, but it's really >>> the last resort to avoid waiting when direct access can't be used. >>> >>> >>> >>>>>> >>>>>> 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the >>>>>> difference between those two? I think we should remove the >>>>>> write_sync_obj >>>>>> bo >>>>>> member. >>>>>> >>>>>> >>>>> >>>>> Okay, but I think we should remove sync_obj instead, and keep write >>>>> and read sync objs. In the case of READWRITE usage, read_sync_obj >>>>> would be equal to write_sync_obj. >>>>> >>>>> >>>>> >>>> >>>> Sure, I'm fine with that. >>>> >>>> One other thing, though, that makes me a little puzzled: >>>> >>>> Let's assume you don't allow readers and writers at the same time, which >>>> is >>>> my perception of how read- and write fences should work; you either have >>>> a >>>> list of read fences or a single write fence (in the same way a read-write >>>> lock works). >>>> >>>> Now, if you only allow a single read fence, like in this patch. That >>>> implies >>>> that you can only have either a single read fence or a single write fence >>>> at >>>> any one time. We'd only need a single fence pointer on the bo, and >>>> sync_obj_arg would tell us whether to signal the fence for read or for >>>> write >>>> (assuming that sync_obj_arg was set up to indicate read / write at >>>> validation time), then the patch really isn't necessary at all, as it >>>> only >>>> allows a single read fence? >>>> >>>> Or is it that you want to allow read- and write fences co-existing? In >>>> that >>>> case, what's the use case? >>>> >>> >>> There are lots of read-write use cases which don't need any barriers >>> or flushing. The obvious ones are color blending and depth-stencil >>> buffering. The OpenGL application is also allowed to use a subrange of >>> a buffer as a vertex buffer (read-only) and another disjoint subrange >>> of the same buffer for transform feedback (write-only), which kinda >>> makes me think about whether we should track subranges instead of >>> treating a whole buffer as "busy". It gets even more funky with >>> ARB_shader_image_load_store, which supports atomic read-modify-write >>> operations on textures, not to mention atomic memory operations in >>> compute shaders (wait, isn't that also exposed in GL as >>> GL_ARB_shader_atomic_counters?). >>> >>> I was thinking whether the two sync objs should be "read" and >>> "readwrite", or "read" and "write". I chose the latter, because it's >>> more fine-grained and we have to keep at least two of them around >>> anyway. >>> >>> So now that you know what we use sync objs for, what are your ideas on >>> re-implementing that patch in a way that is okay with you? Besides >>> removing the third sync objs of course. >>> >>> Marek >>> >> >> OK. First I think we need to make a distinction: bo sync objects vs driver >> fences. The bo sync obj api is there to strictly provide functionality that >> the ttm bo subsystem is using, and that follows a simple set of rules: >> >> 1) the bo subsystem does never assume sync objects are ordered. That means >> the bo subsystem needs to wait on a sync object before removing it from a >> buffer. Any other assumption is buggy and must be fixed. BUT, if that >> assumption takes place in the driver unknowingly from the ttm bo subsystem >> (which is usually the case), it's OK. >> >> 2) When the sync object(s) attached to the bo are signaled the ttm bo >> subsystem is free to copy the bo contents and to unbind the bo. >> >> 3) The ttm bo system allows sync objects to be signaled in different ways >> opaque to the subsystem using sync_obj_arg. The driver is responsible for >> setting up that argument. >> >> 4) Driver fences may be used for or expose other functionality or adaptions >> to APIs as long as the sync obj api exported to the bo sybsystem follows the >> above rules. >> >> This means the following w r t the patch. >> >> A) it violates 1). This is a bug that must be fixed. Assumptions that if one >> sync object is singnaled, another sync object is also signaled must be done >> in the driver and not in the bo subsystem. Hence we need to explicitly wait >> for a fence to remove it from the bo. >> >> B) the sync_obj_arg carries *per-sync-obj* information on how it should be >> signaled. If we need to attach multiple sync objects to a buffer object, we >> also need multiple sync_obj_args. This is a bug and needs to be fixed. >> >> C) There is really only one reason that the ttm bo subsystem should care >> about multiple sync objects, and that is because the driver can't order them >> efficiently. A such example would be hardware with multiple pipes reading >> simultaneously from the same texture buffer. Currently we don't support this >> so only the *last* sync object needs to be know by the bo subsystem. Keeping >> track of multiple fences generates a lot of completely unnecessary code in >> the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / when >> we truly support pipelined moves. >> >> As I understand it from your patches, you want to keep multiple fences >> around only to track rendering history. If we want to do that generically, i >> suggest doing it in the execbuf util code in the following way: >> >> struct ttm_eu_rendering_history { >> void *last_read_sync_obj; >> void *last_read_sync_obj_arg; >> void *last_write_sync_obj; >> void *last_write_sync_obj_arg; >> } >> >> Embed this structure in the radeon_bo, and build a small api around it, >> including *optionally* passing it to the existing execbuf utilities, and you >> should be done. The bo_util code and bo_vm code doesn't care about the >> rendering history. Only that the bo is completely idle. >> >> Note also that when an accelerated bo move is scheduled, the driver needs to >> update this struct >> >> /Thomas > > I should have look at the patch long ago ... anyway i think a better > approach would be to expose fence id as 64bits unsigned to each > userspace client. I was thinking of mapping a page readonly (same page > as the one gpu write back) at somespecific offset in drm file (bit > like sarea but readonly so no lock). Each time userspace submit a > command stream it would get the fence id associated with the command > stream. It would then be up to userspace to track btw read or write > and do appropriate things. The kernel code would be simple (biggest > issue is finding an offset we can use for that), it would be fast as > no round trip to kernel to know the last fence. > > Each fence seq would be valid only for a specific ring (only future > gpu impacted here, maybe cayman). > > So no change to ttm, just change to radeon to return fence seq and to > move to an unsigned 64. Issue would be when gpu write back is > disabled, then we would either need userspace to call somethings like > bo wait or to other ioctl to get the kernel to update the copy, copy > would be updated in the irq handler too so at least it get updated > anytime something enable irq. > > Cheers, > Jerome > Forget to mention that we would need a new wait_fence ioctl to wait for a specific fence seq on a specific ring Cheers, Jerome _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel