On Fri, Oct 7, 2011 at 10:05 AM, Thomas Hellstrom <thomas@xxxxxxxxxxxx> wrote: > On 10/07/2011 03:24 PM, Alex Deucher wrote: >> >> On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom<thomas@xxxxxxxxxxxx> >> wrote: >> >>> >>> On 10/07/2011 12:42 AM, Marek Olšák wrote: >>> >>>> >>>> On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstrom<thomas@xxxxxxxxxxxx> >>>> wrote: >>>> >>>> >>>>> >>>>> In any case, I'm not saying fences is the best way to flush but since >>>>> the >>>>> bo >>>>> code assumes that signaling a sync object means "make the buffer >>>>> contents >>>>> available for CPU read / write", it's usually a good way to do it; >>>>> there's >>>>> even a sync_obj_flush() method that gets called when a potential flush >>>>> needs >>>>> to happen. >>>>> >>>>> >>>> >>>> I don't think we use it like that. To my knowledge, the purpose of the >>>> sync obj (to Radeon Gallium drivers at least) is to be able to wait >>>> for the last use of a buffer. Whether the contents can or cannot be >>>> available to the CPU is totally irrelevant. >>>> >>>> Currently (and it's a very important performance optimization), >>>> buffers stay mapped and available for CPU read/write during their >>>> first map_buffer call. Unmap_buffer is a no-op. The unmapping happens >>>> on buffer destruction. We only call bo_wait when we want to wait for >>>> the GPU until it's done with the buffer (we don't always want that, >>>> sometimes we want to use the unsychronized flag). Otherwise the >>>> contents of buffers are available at *any time*. >>>> >>>> We could probably implement bo_wait privately in the kernel driver and >>>> not use ttm_bo_wait. I preferred code sharing though. >>>> >>>> Textures (especially the tiled ones) are never mapped directly and a >>>> temporary staging resource is used instead, so we don't actually >>>> pollute address space that much. (in case you would have such a >>>> remark) We will use staging resources for buffers too, but it's really >>>> the last resort to avoid waiting when direct access can't be used. >>>> >>>> >>>> >>>> >>>>>>> >>>>>>> 2) Can't we say that a write_sync_obj is simply a sync_obj? What's >>>>>>> the >>>>>>> difference between those two? I think we should remove the >>>>>>> write_sync_obj >>>>>>> bo >>>>>>> member. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> Okay, but I think we should remove sync_obj instead, and keep write >>>>>> and read sync objs. In the case of READWRITE usage, read_sync_obj >>>>>> would be equal to write_sync_obj. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> Sure, I'm fine with that. >>>>> >>>>> One other thing, though, that makes me a little puzzled: >>>>> >>>>> Let's assume you don't allow readers and writers at the same time, >>>>> which >>>>> is >>>>> my perception of how read- and write fences should work; you either >>>>> have >>>>> a >>>>> list of read fences or a single write fence (in the same way a >>>>> read-write >>>>> lock works). >>>>> >>>>> Now, if you only allow a single read fence, like in this patch. That >>>>> implies >>>>> that you can only have either a single read fence or a single write >>>>> fence >>>>> at >>>>> any one time. We'd only need a single fence pointer on the bo, and >>>>> sync_obj_arg would tell us whether to signal the fence for read or for >>>>> write >>>>> (assuming that sync_obj_arg was set up to indicate read / write at >>>>> validation time), then the patch really isn't necessary at all, as it >>>>> only >>>>> allows a single read fence? >>>>> >>>>> Or is it that you want to allow read- and write fences co-existing? In >>>>> that >>>>> case, what's the use case? >>>>> >>>>> >>>> >>>> There are lots of read-write use cases which don't need any barriers >>>> or flushing. The obvious ones are color blending and depth-stencil >>>> buffering. The OpenGL application is also allowed to use a subrange of >>>> a buffer as a vertex buffer (read-only) and another disjoint subrange >>>> of the same buffer for transform feedback (write-only), which kinda >>>> makes me think about whether we should track subranges instead of >>>> treating a whole buffer as "busy". It gets even more funky with >>>> ARB_shader_image_load_store, which supports atomic read-modify-write >>>> operations on textures, not to mention atomic memory operations in >>>> compute shaders (wait, isn't that also exposed in GL as >>>> GL_ARB_shader_atomic_counters?). >>>> >>>> I was thinking whether the two sync objs should be "read" and >>>> "readwrite", or "read" and "write". I chose the latter, because it's >>>> more fine-grained and we have to keep at least two of them around >>>> anyway. >>>> >>>> So now that you know what we use sync objs for, what are your ideas on >>>> re-implementing that patch in a way that is okay with you? Besides >>>> removing the third sync objs of course. >>>> >>>> Marek >>>> >>>> >>> >>> OK. First I think we need to make a distinction: bo sync objects vs >>> driver >>> fences. The bo sync obj api is there to strictly provide functionality >>> that >>> the ttm bo subsystem is using, and that follows a simple set of rules: >>> >>> 1) the bo subsystem does never assume sync objects are ordered. That >>> means >>> the bo subsystem needs to wait on a sync object before removing it from a >>> buffer. Any other assumption is buggy and must be fixed. BUT, if that >>> assumption takes place in the driver unknowingly from the ttm bo >>> subsystem >>> (which is usually the case), it's OK. >>> >>> 2) When the sync object(s) attached to the bo are signaled the ttm bo >>> subsystem is free to copy the bo contents and to unbind the bo. >>> >>> 3) The ttm bo system allows sync objects to be signaled in different ways >>> opaque to the subsystem using sync_obj_arg. The driver is responsible for >>> setting up that argument. >>> >>> 4) Driver fences may be used for or expose other functionality or >>> adaptions >>> to APIs as long as the sync obj api exported to the bo sybsystem follows >>> the >>> above rules. >>> >>> This means the following w r t the patch. >>> >>> A) it violates 1). This is a bug that must be fixed. Assumptions that if >>> one >>> sync object is singnaled, another sync object is also signaled must be >>> done >>> in the driver and not in the bo subsystem. Hence we need to explicitly >>> wait >>> for a fence to remove it from the bo. >>> >>> B) the sync_obj_arg carries *per-sync-obj* information on how it should >>> be >>> signaled. If we need to attach multiple sync objects to a buffer object, >>> we >>> also need multiple sync_obj_args. This is a bug and needs to be fixed. >>> >>> C) There is really only one reason that the ttm bo subsystem should care >>> about multiple sync objects, and that is because the driver can't order >>> them >>> efficiently. A such example would be hardware with multiple pipes reading >>> simultaneously from the same texture buffer. Currently we don't support >>> this >>> so only the *last* sync object needs to be know by the bo subsystem. >>> Keeping >>> track of multiple fences generates a lot of completely unnecessary code >>> in >>> the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / >>> when >>> we truly support pipelined moves. >>> >> >> Just an FYI, cayman GPUs have multiple rings and we will be working to >> support them in the coming months. >> >> Alex >> >> > > Right. Nvidia has had it for a long time, but I think they use barriers > ("Semaphores") to order > sync objects when needed, so that only the last sync object is attached to > the buffer. > > If you need to / want to support simultaneous ring access to a buffer > without such barriers, > we should change the single buffer object fence pointer to a list of > pointers to fence objects. We were planning to use semaphores as well. Alex > > /Thomas > > > > > _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel