Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

Alex Deucher <alexdeucher@xxxxxxxxx> · Fri, 7 Oct 2011 09:24:54 -0400

On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom <thomas@xxxxxxxxxxxx> wrote:
> On 10/07/2011 12:42 AM, Marek Olšák wrote:
>>
>> On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstrom<thomas@xxxxxxxxxxxx>
>>  wrote:
>>
>>>
>>> In any case, I'm not saying fences is the best way to flush but since the
>>> bo
>>> code assumes that signaling a sync object means "make the buffer contents
>>> available for CPU read / write", it's usually a good way to do it;
>>> there's
>>> even a sync_obj_flush() method that gets called when a potential flush
>>> needs
>>> to happen.
>>>
>>
>> I don't think we use it like that. To my knowledge, the purpose of the
>> sync obj (to Radeon Gallium drivers at least) is to be able to wait
>> for the last use of a buffer. Whether the contents can or cannot be
>> available to the CPU is totally irrelevant.
>>
>> Currently (and it's a very important performance optimization),
>> buffers stay mapped and available for CPU read/write during their
>> first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
>> on buffer destruction. We only call bo_wait when we want to wait for
>> the GPU until it's done with the buffer (we don't always want that,
>> sometimes we want to use the unsychronized flag). Otherwise the
>> contents of buffers are available at *any time*.
>>
>> We could probably implement bo_wait privately in the kernel driver and
>> not use ttm_bo_wait. I preferred code sharing though.
>>
>> Textures (especially the tiled ones) are never mapped directly and a
>> temporary staging resource is used instead, so we don't actually
>> pollute address space that much. (in case you would have such a
>> remark) We will use staging resources for buffers too, but it's really
>> the last resort to avoid waiting when direct access can't be used.
>>
>>
>>
>>>>>
>>>>> 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
>>>>> difference between those two? I think we should remove the
>>>>> write_sync_obj
>>>>> bo
>>>>> member.
>>>>>
>>>>>
>>>>
>>>> Okay, but I think we should remove sync_obj instead, and keep write
>>>> and read sync objs. In the case of READWRITE usage, read_sync_obj
>>>> would be equal to write_sync_obj.
>>>>
>>>>
>>>>
>>>
>>> Sure, I'm fine with that.
>>>
>>> One other thing, though, that makes me a little puzzled:
>>>
>>> Let's assume you don't allow readers and writers at the same time, which
>>> is
>>> my perception of how read- and write fences should work; you either have
>>> a
>>> list of read fences or a single write fence (in the same way a read-write
>>> lock works).
>>>
>>> Now, if you only allow a single read fence, like in this patch. That
>>> implies
>>> that you can only have either a single read fence or a single write fence
>>> at
>>> any one time. We'd only need a single fence pointer on the bo, and
>>> sync_obj_arg would tell us whether to signal the fence for read or for
>>> write
>>> (assuming that sync_obj_arg was set up to indicate read / write at
>>> validation time), then the patch really isn't necessary at all, as it
>>> only
>>> allows a single read fence?
>>>
>>> Or is it that you want to allow read- and write fences co-existing? In
>>> that
>>> case, what's the use case?
>>>
>>
>> There are lots of read-write use cases which don't need any barriers
>> or flushing. The obvious ones are color blending and depth-stencil
>> buffering. The OpenGL application is also allowed to use a subrange of
>> a buffer as a vertex buffer (read-only) and another disjoint subrange
>> of the same buffer for transform feedback (write-only), which kinda
>> makes me think about whether we should track subranges instead of
>> treating a whole buffer as "busy". It gets even more funky with
>> ARB_shader_image_load_store, which supports atomic read-modify-write
>> operations on textures, not to mention atomic memory operations in
>> compute shaders (wait, isn't that also exposed in GL as
>> GL_ARB_shader_atomic_counters?).
>>
>> I was thinking whether the two sync objs should be "read" and
>> "readwrite", or "read" and "write". I chose the latter, because it's
>> more fine-grained and we have to keep at least two of them around
>> anyway.
>>
>> So now that you know what we use sync objs for, what are your ideas on
>> re-implementing that patch in a way that is okay with you? Besides
>> removing the third sync objs of course.
>>
>> Marek
>>
>
> OK. First I think we need to make a distinction: bo sync objects vs driver
> fences. The bo sync obj api is there to strictly provide functionality that
> the ttm bo subsystem is using, and that follows a simple set of rules:
>
> 1) the bo subsystem does never assume sync objects are ordered. That means
> the bo subsystem needs to wait on a sync object before removing it from a
> buffer. Any other assumption is buggy and must be fixed. BUT, if that
> assumption takes place in the driver unknowingly from the ttm bo subsystem
> (which is usually the case), it's OK.
>
> 2) When the sync object(s) attached to the bo are signaled the ttm bo
> subsystem is free to copy the bo contents and to unbind the bo.
>
> 3) The ttm bo system allows sync objects to be signaled in different ways
> opaque to the subsystem using sync_obj_arg. The driver is responsible for
> setting up that argument.
>
> 4) Driver fences may be used for or expose other functionality or adaptions
> to APIs as long as the sync obj api exported to the bo sybsystem follows the
> above rules.
>
> This means the following w r t the patch.
>
> A) it violates 1). This is a bug that must be fixed. Assumptions that if one
> sync object is singnaled, another sync object is also signaled must be done
> in the driver and not in the bo subsystem. Hence we need to explicitly wait
> for a fence to remove it from the bo.
>
> B) the sync_obj_arg carries *per-sync-obj* information on how it should be
> signaled. If we need to attach multiple sync objects to a buffer object, we
> also need multiple sync_obj_args. This is a bug and needs to be fixed.
>
> C) There is really only one reason that the ttm bo subsystem should care
> about multiple sync objects, and that is because the driver can't order them
> efficiently. A such example would be hardware with multiple pipes reading
> simultaneously from the same texture buffer. Currently we don't support this
> so only the *last* sync object needs to be know by the bo subsystem. Keeping
> track of multiple fences generates a lot of completely unnecessary code in
> the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / when
> we truly support pipelined moves.

Just an FYI, cayman GPUs have multiple rings and we will be working to
support them in the coming months.

Alex

>
> As I understand it from your patches, you want to keep multiple fences
> around only to track rendering history. If we want to do that generically, i
> suggest doing it in the execbuf util code in the following way:
>
> struct ttm_eu_rendering_history {
>    void *last_read_sync_obj;
>    void *last_read_sync_obj_arg;
>    void *last_write_sync_obj;
>    void *last_write_sync_obj_arg;
> }
>
> Embed this structure in the radeon_bo, and build a small api around it,
> including *optionally* passing it to the existing execbuf utilities, and you
> should be done. The bo_util code and bo_vm code doesn't care about the
> rendering history. Only that the bo is completely idle.
>
> Note also that when an accelerated bo move is scheduled, the driver needs to
> update this struct
>
> /Thomas
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel