Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

Jerome Glisse <j.glisse@xxxxxxxxx> · Fri, 7 Oct 2011 09:38:12 -0400

On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom <thomas@xxxxxxxxxxxx> wrote:
> On 10/07/2011 12:42 AM, Marek Olšák wrote:
>>
>> On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstrom<thomas@xxxxxxxxxxxx>
>>  wrote:
>>
>>>
>>> In any case, I'm not saying fences is the best way to flush but since the
>>> bo
>>> code assumes that signaling a sync object means "make the buffer contents
>>> available for CPU read / write", it's usually a good way to do it;
>>> there's
>>> even a sync_obj_flush() method that gets called when a potential flush
>>> needs
>>> to happen.
>>>
>>
>> I don't think we use it like that. To my knowledge, the purpose of the
>> sync obj (to Radeon Gallium drivers at least) is to be able to wait
>> for the last use of a buffer. Whether the contents can or cannot be
>> available to the CPU is totally irrelevant.
>>
>> Currently (and it's a very important performance optimization),
>> buffers stay mapped and available for CPU read/write during their
>> first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
>> on buffer destruction. We only call bo_wait when we want to wait for
>> the GPU until it's done with the buffer (we don't always want that,
>> sometimes we want to use the unsychronized flag). Otherwise the
>> contents of buffers are available at *any time*.
>>
>> We could probably implement bo_wait privately in the kernel driver and
>> not use ttm_bo_wait. I preferred code sharing though.
>>
>> Textures (especially the tiled ones) are never mapped directly and a
>> temporary staging resource is used instead, so we don't actually
>> pollute address space that much. (in case you would have such a
>> remark) We will use staging resources for buffers too, but it's really
>> the last resort to avoid waiting when direct access can't be used.
>>
>>
>>
>>>>>
>>>>> 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
>>>>> difference between those two? I think we should remove the
>>>>> write_sync_obj
>>>>> bo
>>>>> member.
>>>>>
>>>>>
>>>>
>>>> Okay, but I think we should remove sync_obj instead, and keep write
>>>> and read sync objs. In the case of READWRITE usage, read_sync_obj
>>>> would be equal to write_sync_obj.
>>>>
>>>>
>>>>
>>>
>>> Sure, I'm fine with that.
>>>
>>> One other thing, though, that makes me a little puzzled:
>>>
>>> Let's assume you don't allow readers and writers at the same time, which
>>> is
>>> my perception of how read- and write fences should work; you either have
>>> a
>>> list of read fences or a single write fence (in the same way a read-write
>>> lock works).
>>>
>>> Now, if you only allow a single read fence, like in this patch. That
>>> implies
>>> that you can only have either a single read fence or a single write fence
>>> at
>>> any one time. We'd only need a single fence pointer on the bo, and
>>> sync_obj_arg would tell us whether to signal the fence for read or for
>>> write
>>> (assuming that sync_obj_arg was set up to indicate read / write at
>>> validation time), then the patch really isn't necessary at all, as it
>>> only
>>> allows a single read fence?
>>>
>>> Or is it that you want to allow read- and write fences co-existing? In
>>> that
>>> case, what's the use case?
>>>
>>
>> There are lots of read-write use cases which don't need any barriers
>> or flushing. The obvious ones are color blending and depth-stencil
>> buffering. The OpenGL application is also allowed to use a subrange of
>> a buffer as a vertex buffer (read-only) and another disjoint subrange
>> of the same buffer for transform feedback (write-only), which kinda
>> makes me think about whether we should track subranges instead of
>> treating a whole buffer as "busy". It gets even more funky with
>> ARB_shader_image_load_store, which supports atomic read-modify-write
>> operations on textures, not to mention atomic memory operations in
>> compute shaders (wait, isn't that also exposed in GL as
>> GL_ARB_shader_atomic_counters?).
>>
>> I was thinking whether the two sync objs should be "read" and
>> "readwrite", or "read" and "write". I chose the latter, because it's
>> more fine-grained and we have to keep at least two of them around
>> anyway.
>>
>> So now that you know what we use sync objs for, what are your ideas on
>> re-implementing that patch in a way that is okay with you? Besides
>> removing the third sync objs of course.
>>
>> Marek
>>
>
> OK. First I think we need to make a distinction: bo sync objects vs driver
> fences. The bo sync obj api is there to strictly provide functionality that
> the ttm bo subsystem is using, and that follows a simple set of rules:
>
> 1) the bo subsystem does never assume sync objects are ordered. That means
> the bo subsystem needs to wait on a sync object before removing it from a
> buffer. Any other assumption is buggy and must be fixed. BUT, if that
> assumption takes place in the driver unknowingly from the ttm bo subsystem
> (which is usually the case), it's OK.
>
> 2) When the sync object(s) attached to the bo are signaled the ttm bo
> subsystem is free to copy the bo contents and to unbind the bo.
>
> 3) The ttm bo system allows sync objects to be signaled in different ways
> opaque to the subsystem using sync_obj_arg. The driver is responsible for
> setting up that argument.
>
> 4) Driver fences may be used for or expose other functionality or adaptions
> to APIs as long as the sync obj api exported to the bo sybsystem follows the
> above rules.
>
> This means the following w r t the patch.
>
> A) it violates 1). This is a bug that must be fixed. Assumptions that if one
> sync object is singnaled, another sync object is also signaled must be done
> in the driver and not in the bo subsystem. Hence we need to explicitly wait
> for a fence to remove it from the bo.
>
> B) the sync_obj_arg carries *per-sync-obj* information on how it should be
> signaled. If we need to attach multiple sync objects to a buffer object, we
> also need multiple sync_obj_args. This is a bug and needs to be fixed.
>
> C) There is really only one reason that the ttm bo subsystem should care
> about multiple sync objects, and that is because the driver can't order them
> efficiently. A such example would be hardware with multiple pipes reading
> simultaneously from the same texture buffer. Currently we don't support this
> so only the *last* sync object needs to be know by the bo subsystem. Keeping
> track of multiple fences generates a lot of completely unnecessary code in
> the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / when
> we truly support pipelined moves.
>
> As I understand it from your patches, you want to keep multiple fences
> around only to track rendering history. If we want to do that generically, i
> suggest doing it in the execbuf util code in the following way:
>
> struct ttm_eu_rendering_history {
>    void *last_read_sync_obj;
>    void *last_read_sync_obj_arg;
>    void *last_write_sync_obj;
>    void *last_write_sync_obj_arg;
> }
>
> Embed this structure in the radeon_bo, and build a small api around it,
> including *optionally* passing it to the existing execbuf utilities, and you
> should be done. The bo_util code and bo_vm code doesn't care about the
> rendering history. Only that the bo is completely idle.
>
> Note also that when an accelerated bo move is scheduled, the driver needs to
> update this struct
>
> /Thomas

I should have look at the patch long ago ... anyway i think a better
approach would be to expose fence id as 64bits unsigned to each
userspace client. I was thinking of mapping a page readonly (same page
as the one gpu write back) at somespecific offset in drm file (bit
like sarea but readonly so no lock). Each time userspace submit a
command stream it would get the fence id associated with the command
stream. It would then be up to userspace to track btw read or write
and do appropriate things. The kernel code would be simple (biggest
issue is finding an offset we can use for that), it would be fast as
no round trip to kernel to know the last fence.

Each fence seq would be valid only for a specific ring (only future
gpu impacted here, maybe cayman).

So no change to ttm, just change to radeon to return fence seq and to
move to an unsigned 64. Issue would be when gpu write back is
disabled, then we would either need userspace to call somethings like
bo wait or to other ioctl to get the kernel to update the copy, copy
would be updated in the irq handler too so at least it get updated
anytime something enable irq.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel