Re: [PATCH 2/5] drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep

Thomas Hellstrom <thellstrom@xxxxxxxxxx> · Wed, 22 Jan 2014 16:30:09 +0100

On 01/22/2014 04:09 PM, Daniel Vetter wrote:
> On Wed, Jan 22, 2014 at 01:52:51PM +0100, Thomas Hellstrom wrote:
>> On 01/22/2014 01:38 PM, Maarten Lankhorst wrote:
>>> op 22-01-14 13:11, Thomas Hellstrom schreef:
>>>> On 01/22/2014 11:58 AM, Maarten Lankhorst wrote:
>>>>> op 22-01-14 11:27, Thomas Hellstrom schreef:
>>>>>> On 01/22/2014 10:55 AM, Maarten Lankhorst wrote:
>>>>>>> op 22-01-14 10:40, Thomas Hellstrom schreef:
>>>>>>>> On 01/22/2014 09:19 AM, Maarten Lankhorst wrote:
>>>>>>>>> op 21-01-14 18:44, Thomas Hellstrom schreef:
>>>>>>>>>> On 01/21/2014 04:29 PM, Maarten Lankhorst wrote:
>>>>>>>>>>> Hey,
>>>>>>>>>>>
>>>>>>>>>>> op 21-01-14 16:17, Thomas Hellstrom schreef:
>>>>>>>>>>>> Maarten, for this and the other patches in this series,
>>>>>>>>>>>>
>>>>>>>>>>>> I seem to recall we have this discussion before?
>>>>>>>>>>>> IIRC I stated that reservation was a too heavy-weight lock to
>>>>>>>>>>>> hold to
>>>>>>>>>>>> determine whether a buffer was idle? It's a pretty nasty
>>>>>>>>>>>> thing to
>>>>>>>>>>>> build in.
>>>>>>>>>>>>
>>>>>>>>>>> I've sent this patch after determining that this already didn't
>>>>>>>>>>> end up
>>>>>>>>>>> being heavyweight.
>>>>>>>>>>> Most places were already using the fence_lock and reservation, I
>>>>>>>>>>> just
>>>>>>>>>>> fixed up the few
>>>>>>>>>>> places that didn't hold a reservation while waiting.
>>>>>>>>>>> Converting the
>>>>>>>>>>> few places that didn't
>>>>>>>>>>> ended up being trivial, so I thought I'd submit it.
>>>>>>>>>> Actually the only *valid* reason for holding a reservation when
>>>>>>>>>> waiting
>>>>>>>>>> for idle is
>>>>>>>>>> 1) You want to block further command submission on the buffer.
>>>>>>>>>> 2) You want to switch GPU engine and don't have access to gpu
>>>>>>>>>> semaphores
>>>>>>>>>> / barriers.
>>>>>>>>>>
>>>>>>>>>> Reservation has the nasty side effect that it blocks command
>>>>>>>>>> submission
>>>>>>>>>> and pins the buffer (in addition now makes the evict list
>>>>>>>>>> traversals
>>>>>>>>>> skip the buffer) which in general is *not* necessary for most wait
>>>>>>>>>> cases, so we should instead actually convert the wait cases that
>>>>>>>>>> don't
>>>>>>>>>> fulfill 1) and 2) above in the other direction if we have
>>>>>>>>>> performance
>>>>>>>>>> and latency-reduction in mind. I can't see how a spinlock
>>>>>>>>>> protecting a
>>>>>>>>>> fence pointer or fence list is stopping you from using RW
>>>>>>>>>> fences as
>>>>>>>>>> long
>>>>>>>>>> as the spinlock is held while manipulating the fence list?
>>>>>>>>>>
>>>>>>>>> You wish. Fine I'll enumerate all cases of ttm_bo_wait (with the
>>>>>>>>> patchset, though) and enumerate if they can be changed to work
>>>>>>>>> without
>>>>>>>>> reservation or not.
>>>>>>>>>
>>>>>>>>> ttm/ttm_bo.c
>>>>>>>>> ttm_bo_cleanup_refs_or_queue: needs reservation and ttm_bo_wait to
>>>>>>>>> finish for the direct destroy fastpath, if either fails it needs
>>>>>>>>> to be
>>>>>>>>> queued. Cannot work without reservation.
>>>>>>>> Doesn't block and no significant reservation contention expected.
>>>>>>>>
>>>>>>>>> ttm_bo_cleanup_refs_and_unlock: already drops reservation to wait,
>>>>>>>>> doesn't need to re-acquire. Simply reordering ttm_bo_wait until
>>>>>>>>> after
>>>>>>>>> re-reserve is enough.
>>>>>>>> Currently follows the above rules.
>>>>>>>>
>>>>>>>>> ttm_bo_evict: already has the reservation, cannot be dropped since
>>>>>>>>> only trylock is allowed. Dropping reservation would cause badness,
>>>>>>>>> cannot be converted.
>>>>>>>> Follows rule 2 above. We're about to move the buffer and if that
>>>>>>>> can't
>>>>>>>> be pipelined using the GPU (which TTM currently doesn't allow), we
>>>>>>>> need
>>>>>>>> to wait. Although eviction should be low priority compared to new
>>>>>>>> command submission, so I can't really see why we couldn't wait
>>>>>>>> before
>>>>>>>> trying to reserve here?
>>>>>>>>
>>>>>>>>> ttm_bo_move_buffer: called from ttm_bo_validate, cannot drop
>>>>>>>>> reservation for same reason as ttm_bo_evict. It might be part of a
>>>>>>>>> ticketed reservation so really don't drop lock here.
>>>>>>>> Part of command submission and as such follows rule 2 above. If
>>>>>>>> we can
>>>>>>>> pipeline the move with the GPU, no need to wait (but needs to be
>>>>>>>> implemented, of course).
>>>>>>>>
>>>>>>>>> ttm_bo_synccpu_write_grab: the wait could be converted to be done
>>>>>>>>> afterwards, without  fence_lock. But in this case reservation could
>>>>>>>>> take the role of fence_lock too,
>>>>>>>>>
>>>>>>>>> so no separate fence_lock would be needed.
>>>>>>>> With the exception that reservation is more likely to be contended.
>>>>>>> True but rule 1.
>>>>>>>>> ttm_bo_swapout: see ttm_bo_evict.
>>>>>>>>>
>>>>>>>>> ttm/ttm_bo_util.c:
>>>>>>>>> ttm_bo_move_accel_cleanup: calls ttm_bo_wait, cannot drop lock, see
>>>>>>>>> ttm_bo_move_buffer, can be called from that function.
>>>>>>>> Rule 2.
>>>>>>>>
>>>>>>>>> ttm/ttm_bo_vm.c
>>>>>>>>> ttm_bo_vm_fault_idle: I guess you COULD drop the reservation here,
>>>>>>>>> but
>>>>>>>>> you already had the reservation, so a similar optimization to
>>>>>>>>> ttm_bo_synccpu_write_grab could be done without requiring
>>>>>>>>> fence_lock.
>>>>>>>>> If you would write it like that, you would end up with a patch
>>>>>>>>> similar
>>>>>>>>> to drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep. I
>>>>>>>>> think
>>>>>>>>> we should do this, an
>>>>>>>>>
>>>>>>>>> Ok, so the core does NOT need fence_lock because we can never drop
>>>>>>>>> reservations except in synccpu_write_grab and maybe
>>>>>>>>> ttm_bo_vm_fault_idle, but even in those cases reservation is
>>>>>>>>> done. So
>>>>>>>>> that could be used instead of fence_lock.
>>>>>>>>>
>>>>>>>>> nouveau_gem_ioctl_cpu_prep:
>>>>>>>>> Either block on a global spinlock or a local reservation lock.
>>>>>>>>> Doesn't
>>>>>>>>> matter much which, I don't need the need to keep a global lock for
>>>>>>>>> this function...
>>>>>>>>> 2 cases can happen in the trylock reservation failure case:
>>>>>>>>> buffer is
>>>>>>>>> not reserved, so it's not in the process of being evicted.
>>>>>>>>> buffer is
>>>>>>>>> reserved, which means it's being used in command submission right
>>>>>>>>> now,
>>>>>>>>> or in one of the functions described above (eg not idle).
>>>>>>>>>
>>>>>>>>> nouveau_gem_pushbuf_reloc_apply:
>>>>>>>>> has to call ttm_bo_wait with reservation, cannot be dropped.
>>>>>>>>>
>>>>>>>>> So for core ttm and nouveau the fence_lock is never needed, radeon
>>>>>>>>> has
>>>>>>>>> only 1 function that calls ttm_bo_wait which uses a reservation
>>>>>>>>> too.
>>>>>>>>> It doesn't need the fence_lock either.
>>>>>>>> And vmwgfx now also has a syccpu IOCTL (see drm-next).
>>>>>>>>
>>>>>>>> So assuming that we converted the functions that can be converted to
>>>>>>>> wait outside of reservation, the same way you have done with
>>>>>>>> Nouveau,
>>>>>>>> leaving the ones that fall under 1) and 2) above, I would still
>>>>>>>> argue
>>>>>>>> that a spinlock should be used because taking a reservation may
>>>>>>>> implicitly mean wait for gpu, and could give bad performance- and
>>>>>>>> latency charateristics. You shouldn't need to wait for gpu to check
>>>>>>>> for
>>>>>>>> buffer idle.
>>>>>>> Except that without reservation you can't tell if the buffer is
>>>>>>> really
>>>>>>> idle, or is currently being
>>>>>>> used as part of some command submission/eviction before the fence
>>>>>>> pointer is set.
>>>>>>>
>>>>>> Yes, but when that matters, you're either in case 1 or case 2 again.
>>>>>> Otherwise, when you release the reservation, you still don't know.
>>>>>> A typical example of this is the vmwgfx synccpu ioctl, where you can
>>>>>> either choose to block command submission (not used currently)
>>>>>> or not (user-space inter-process synchronization). The former is a
>>>>>> case
>>>>>> 1 wait and holds reservation while waiting for idle and then ups
>>>>>> cpu_writers. The latter waits without reservation for previously
>>>>>> submitted rendering to finish.
>>>>> Yeah you could, but what exactly are you waiting on then? If it's some
>>>>> specific existing rendering,
>>>>> I would argue that you should create an android userspace fence during
>>>>> command submission,
>>>>> or provide your own api to block on a specfic fence in userspace.
>>>>>
>>>>> If you don't then I think taking a reservation is not unreasonable. In
>>>>> the most common case the buffer is
>>>>> idle and not reserved, so it isn't contested. The actual waiting
>>>>> itself can be done without reservation held,
>>>>> by taking a reference on the fence.
>>>> Yeah, here is where we disagree. I'm afraid people will start getting
>>>> sloppy with reservations and use them to protect more stuff, and after a
>>>> while they start wondering why the GPU command queue drains...
>>>>
>>>> Perhaps we could agree on a solution (building on one of your original
>>>> ideas) where we require reservation to modify the fence pointers, and
>>>> the buffer object moving flag, but the structure holding the fence
>>>> pointer(s) is RCU safe, so that the pointers can be safely read under an
>>>> rcu lock.
>>> I think not modifying the fence pointer without reservation would be
>>> safest.
>>> I also don't think readers need the capability to clear sync_obj, this
>>> might
>>> simplify the implementation some.
>>>
>>> But my preferred option is getting rid of sync_obj completely, and
>>> move to
>>> using reservation_object->fence_shared/exclusive, like the incomplete
>>> proof
>>> of concept conversion done in nouveau. But then I do need to grab the
>>> reservation lock to touch things, because fences may be set by the
>>> i915 driver
>>> I share the reservation_object with.
>>>
>>> Alternatively could vmwgfx hold a spinlock when decrementing fence
>>> refcount instead?
>>> Then we wouldn't need this in the core, and vmwgfx could use:
>> Maarten,
>> requiring reservation to access the fence pointers really turns my gut!
>> Being able to read them under rcu is a remedy, but something I figure
>> would be the default and recommended thing to do. Not a vmware
>> exception. This is about as far as I'm prepared to go.
> Let me jump into your discussion and have a bit of fun too ;-)
>
> More seriously I think we should take a step back and look at the larger
> picture: The overall aim is to allow cross-device shared dma-bufs to get
> fenced/reserved/whatever. Which means the per-device fence_lock ttm is
> currently using won't work any more. So we need to change things a bit.
>
> I see a few solutions. Note that I haven't checked the implications for
> existing drivers (especially ttm) in detail, so please correct me when
> some of these ideas are horrible to implement:
>
> - Make fence_lock a global thing instead of per-device. Probably not what
>   we want given that dma-buf (and also all the ttm state) has more
>   fine-grained locking.
>
> - Remove the fence_lock and protect fences with the reservation lock the
>   dma-buf already has. Has the appeal of being the simplest solution, at
>   least if we exclude the One Lock to Rule Them all approach ;-)
>
> - Add a new per-buffer spinlock just to protect the fences. Could end up
>   being rather costly for the non-contended common case where we just want
>   to push tons of buffers through execbuf ioctls.
>
> - Allow fences attached to dma-bufs to be accessed read-only (to grab
>   references of them and wait on them) using rcu as protection. I think we
>   need some trickery with kref_get_unless_zero to make sure the
>   rcu-delayed freeing of fences doesn't race in bad ways with lockless
>   kref_gets. Another approach would be to rcu-delay all kref_puts, but I
>   don't think we want that.
>
> Personally I prefer the 2nd approach since it's the simplest, while not
> being un-scalable like the first. In my experience with the single lock in
> i915 where any contention and especially any waiting while holding locks
> is massively exagerated is that locking dropping games around the common
> gpu wait points are sufficient. Actually in almost all case the fence_lock
> wouldn't be sufficient for us since we need to check buffer placement,
> mmaps and similar things anyway.
>
> Now I see that there's valid cases where we want the lowest possible
> overhead for waiting on or just checking for outstanding rendering. OpenCL
> with it's fancy/explicit synchronization model seems to be the prime
> example that usually pops up. For such uses I think it's better to just
> directly expose fences to userspace and completely eshew any indirection
> through buffer objects.
>
> That leaves the issue of stalling unrelated processes when they try to
> look up fences when another process is heavily thrashing vram and so holds
> tons of reservations on non-shared objects and blocks waiting for the gpu
> to complete more batches. But from my cursory understanding non-UMA
> platforms currently (at least with the drivers we have and the memory
> management logic) drop off a rather steep cliff. So I fear that
> micro-optimizing this case is investing complexity into the wrong place.
> At least in i915 we've implemented quite a pile of tricks to smooth off
> the gtt thrashing cliff before even considering improving lock contention
> and lock holding times.
>
> So overall I'm heavily favouring the simple approach of just reusing the
> reservation ww mutex to protec fence state, but I'm definitely not
> rejecting more complex approaches out of hand. I just think that we should
> have solid data to justify the complexity.
>
> Finally if we can't reach an agreement here I guess we could duct-tape
> something together where ttm objects only used by a single driver are
> protected by the fence_lock and other, shared buffers are protected by the
> reservation. It won't be pretty, but the impact should be fairly
> contained. Especially since many paths in ttm currently grab both locks
> anyway, so wouldn't need any special handling.
>
> I hope this helps to move the discussion forward.

First,
I think in a situation like this with radically different opinions, one
needs to be prepared to compromise to move things forward. And not just
on the TTM side.

And I don't think making the dma-buf fence pointer structure rcu-safe is
a big step that is in any way  complex.
If the sync_object- or fence ops don't support get_unless_zero(), we
simply don't use the RCU path. Fence object exporters that do care
implement those.

/Thomas

>
> Cheers, Daniel
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel