On 01/22/2014 04:09 PM, Daniel Vetter wrote: > On Wed, Jan 22, 2014 at 01:52:51PM +0100, Thomas Hellstrom wrote: >> On 01/22/2014 01:38 PM, Maarten Lankhorst wrote: >>> op 22-01-14 13:11, Thomas Hellstrom schreef: >>>> On 01/22/2014 11:58 AM, Maarten Lankhorst wrote: >>>>> op 22-01-14 11:27, Thomas Hellstrom schreef: >>>>>> On 01/22/2014 10:55 AM, Maarten Lankhorst wrote: >>>>>>> op 22-01-14 10:40, Thomas Hellstrom schreef: >>>>>>>> On 01/22/2014 09:19 AM, Maarten Lankhorst wrote: >>>>>>>>> op 21-01-14 18:44, Thomas Hellstrom schreef: >>>>>>>>>> On 01/21/2014 04:29 PM, Maarten Lankhorst wrote: >>>>>>>>>>> Hey, >>>>>>>>>>> >>>>>>>>>>> op 21-01-14 16:17, Thomas Hellstrom schreef: >>>>>>>>>>>> Maarten, for this and the other patches in this series, >>>>>>>>>>>> >>>>>>>>>>>> I seem to recall we have this discussion before? >>>>>>>>>>>> IIRC I stated that reservation was a too heavy-weight lock to >>>>>>>>>>>> hold to >>>>>>>>>>>> determine whether a buffer was idle? It's a pretty nasty >>>>>>>>>>>> thing to >>>>>>>>>>>> build in. >>>>>>>>>>>> >>>>>>>>>>> I've sent this patch after determining that this already didn't >>>>>>>>>>> end up >>>>>>>>>>> being heavyweight. >>>>>>>>>>> Most places were already using the fence_lock and reservation, I >>>>>>>>>>> just >>>>>>>>>>> fixed up the few >>>>>>>>>>> places that didn't hold a reservation while waiting. >>>>>>>>>>> Converting the >>>>>>>>>>> few places that didn't >>>>>>>>>>> ended up being trivial, so I thought I'd submit it. >>>>>>>>>> Actually the only *valid* reason for holding a reservation when >>>>>>>>>> waiting >>>>>>>>>> for idle is >>>>>>>>>> 1) You want to block further command submission on the buffer. >>>>>>>>>> 2) You want to switch GPU engine and don't have access to gpu >>>>>>>>>> semaphores >>>>>>>>>> / barriers. >>>>>>>>>> >>>>>>>>>> Reservation has the nasty side effect that it blocks command >>>>>>>>>> submission >>>>>>>>>> and pins the buffer (in addition now makes the evict list >>>>>>>>>> traversals >>>>>>>>>> skip the buffer) which in general is *not* necessary for most wait >>>>>>>>>> cases, so we should instead actually convert the wait cases that >>>>>>>>>> don't >>>>>>>>>> fulfill 1) and 2) above in the other direction if we have >>>>>>>>>> performance >>>>>>>>>> and latency-reduction in mind. I can't see how a spinlock >>>>>>>>>> protecting a >>>>>>>>>> fence pointer or fence list is stopping you from using RW >>>>>>>>>> fences as >>>>>>>>>> long >>>>>>>>>> as the spinlock is held while manipulating the fence list? >>>>>>>>>> >>>>>>>>> You wish. Fine I'll enumerate all cases of ttm_bo_wait (with the >>>>>>>>> patchset, though) and enumerate if they can be changed to work >>>>>>>>> without >>>>>>>>> reservation or not. >>>>>>>>> >>>>>>>>> ttm/ttm_bo.c >>>>>>>>> ttm_bo_cleanup_refs_or_queue: needs reservation and ttm_bo_wait to >>>>>>>>> finish for the direct destroy fastpath, if either fails it needs >>>>>>>>> to be >>>>>>>>> queued. Cannot work without reservation. >>>>>>>> Doesn't block and no significant reservation contention expected. >>>>>>>> >>>>>>>>> ttm_bo_cleanup_refs_and_unlock: already drops reservation to wait, >>>>>>>>> doesn't need to re-acquire. Simply reordering ttm_bo_wait until >>>>>>>>> after >>>>>>>>> re-reserve is enough. >>>>>>>> Currently follows the above rules. >>>>>>>> >>>>>>>>> ttm_bo_evict: already has the reservation, cannot be dropped since >>>>>>>>> only trylock is allowed. Dropping reservation would cause badness, >>>>>>>>> cannot be converted. >>>>>>>> Follows rule 2 above. We're about to move the buffer and if that >>>>>>>> can't >>>>>>>> be pipelined using the GPU (which TTM currently doesn't allow), we >>>>>>>> need >>>>>>>> to wait. Although eviction should be low priority compared to new >>>>>>>> command submission, so I can't really see why we couldn't wait >>>>>>>> before >>>>>>>> trying to reserve here? >>>>>>>> >>>>>>>>> ttm_bo_move_buffer: called from ttm_bo_validate, cannot drop >>>>>>>>> reservation for same reason as ttm_bo_evict. It might be part of a >>>>>>>>> ticketed reservation so really don't drop lock here. >>>>>>>> Part of command submission and as such follows rule 2 above. If >>>>>>>> we can >>>>>>>> pipeline the move with the GPU, no need to wait (but needs to be >>>>>>>> implemented, of course). >>>>>>>> >>>>>>>>> ttm_bo_synccpu_write_grab: the wait could be converted to be done >>>>>>>>> afterwards, without fence_lock. But in this case reservation could >>>>>>>>> take the role of fence_lock too, >>>>>>>>> >>>>>>>>> so no separate fence_lock would be needed. >>>>>>>> With the exception that reservation is more likely to be contended. >>>>>>> True but rule 1. >>>>>>>>> ttm_bo_swapout: see ttm_bo_evict. >>>>>>>>> >>>>>>>>> ttm/ttm_bo_util.c: >>>>>>>>> ttm_bo_move_accel_cleanup: calls ttm_bo_wait, cannot drop lock, see >>>>>>>>> ttm_bo_move_buffer, can be called from that function. >>>>>>>> Rule 2. >>>>>>>> >>>>>>>>> ttm/ttm_bo_vm.c >>>>>>>>> ttm_bo_vm_fault_idle: I guess you COULD drop the reservation here, >>>>>>>>> but >>>>>>>>> you already had the reservation, so a similar optimization to >>>>>>>>> ttm_bo_synccpu_write_grab could be done without requiring >>>>>>>>> fence_lock. >>>>>>>>> If you would write it like that, you would end up with a patch >>>>>>>>> similar >>>>>>>>> to drm/nouveau: add reservation to nouveau_gem_ioctl_cpu_prep. I >>>>>>>>> think >>>>>>>>> we should do this, an >>>>>>>>> >>>>>>>>> Ok, so the core does NOT need fence_lock because we can never drop >>>>>>>>> reservations except in synccpu_write_grab and maybe >>>>>>>>> ttm_bo_vm_fault_idle, but even in those cases reservation is >>>>>>>>> done. So >>>>>>>>> that could be used instead of fence_lock. >>>>>>>>> >>>>>>>>> nouveau_gem_ioctl_cpu_prep: >>>>>>>>> Either block on a global spinlock or a local reservation lock. >>>>>>>>> Doesn't >>>>>>>>> matter much which, I don't need the need to keep a global lock for >>>>>>>>> this function... >>>>>>>>> 2 cases can happen in the trylock reservation failure case: >>>>>>>>> buffer is >>>>>>>>> not reserved, so it's not in the process of being evicted. >>>>>>>>> buffer is >>>>>>>>> reserved, which means it's being used in command submission right >>>>>>>>> now, >>>>>>>>> or in one of the functions described above (eg not idle). >>>>>>>>> >>>>>>>>> nouveau_gem_pushbuf_reloc_apply: >>>>>>>>> has to call ttm_bo_wait with reservation, cannot be dropped. >>>>>>>>> >>>>>>>>> So for core ttm and nouveau the fence_lock is never needed, radeon >>>>>>>>> has >>>>>>>>> only 1 function that calls ttm_bo_wait which uses a reservation >>>>>>>>> too. >>>>>>>>> It doesn't need the fence_lock either. >>>>>>>> And vmwgfx now also has a syccpu IOCTL (see drm-next). >>>>>>>> >>>>>>>> So assuming that we converted the functions that can be converted to >>>>>>>> wait outside of reservation, the same way you have done with >>>>>>>> Nouveau, >>>>>>>> leaving the ones that fall under 1) and 2) above, I would still >>>>>>>> argue >>>>>>>> that a spinlock should be used because taking a reservation may >>>>>>>> implicitly mean wait for gpu, and could give bad performance- and >>>>>>>> latency charateristics. You shouldn't need to wait for gpu to check >>>>>>>> for >>>>>>>> buffer idle. >>>>>>> Except that without reservation you can't tell if the buffer is >>>>>>> really >>>>>>> idle, or is currently being >>>>>>> used as part of some command submission/eviction before the fence >>>>>>> pointer is set. >>>>>>> >>>>>> Yes, but when that matters, you're either in case 1 or case 2 again. >>>>>> Otherwise, when you release the reservation, you still don't know. >>>>>> A typical example of this is the vmwgfx synccpu ioctl, where you can >>>>>> either choose to block command submission (not used currently) >>>>>> or not (user-space inter-process synchronization). The former is a >>>>>> case >>>>>> 1 wait and holds reservation while waiting for idle and then ups >>>>>> cpu_writers. The latter waits without reservation for previously >>>>>> submitted rendering to finish. >>>>> Yeah you could, but what exactly are you waiting on then? If it's some >>>>> specific existing rendering, >>>>> I would argue that you should create an android userspace fence during >>>>> command submission, >>>>> or provide your own api to block on a specfic fence in userspace. >>>>> >>>>> If you don't then I think taking a reservation is not unreasonable. In >>>>> the most common case the buffer is >>>>> idle and not reserved, so it isn't contested. The actual waiting >>>>> itself can be done without reservation held, >>>>> by taking a reference on the fence. >>>> Yeah, here is where we disagree. I'm afraid people will start getting >>>> sloppy with reservations and use them to protect more stuff, and after a >>>> while they start wondering why the GPU command queue drains... >>>> >>>> Perhaps we could agree on a solution (building on one of your original >>>> ideas) where we require reservation to modify the fence pointers, and >>>> the buffer object moving flag, but the structure holding the fence >>>> pointer(s) is RCU safe, so that the pointers can be safely read under an >>>> rcu lock. >>> I think not modifying the fence pointer without reservation would be >>> safest. >>> I also don't think readers need the capability to clear sync_obj, this >>> might >>> simplify the implementation some. >>> >>> But my preferred option is getting rid of sync_obj completely, and >>> move to >>> using reservation_object->fence_shared/exclusive, like the incomplete >>> proof >>> of concept conversion done in nouveau. But then I do need to grab the >>> reservation lock to touch things, because fences may be set by the >>> i915 driver >>> I share the reservation_object with. >>> >>> Alternatively could vmwgfx hold a spinlock when decrementing fence >>> refcount instead? >>> Then we wouldn't need this in the core, and vmwgfx could use: >> Maarten, >> requiring reservation to access the fence pointers really turns my gut! >> Being able to read them under rcu is a remedy, but something I figure >> would be the default and recommended thing to do. Not a vmware >> exception. This is about as far as I'm prepared to go. > Let me jump into your discussion and have a bit of fun too ;-) > > More seriously I think we should take a step back and look at the larger > picture: The overall aim is to allow cross-device shared dma-bufs to get > fenced/reserved/whatever. Which means the per-device fence_lock ttm is > currently using won't work any more. So we need to change things a bit. > > I see a few solutions. Note that I haven't checked the implications for > existing drivers (especially ttm) in detail, so please correct me when > some of these ideas are horrible to implement: > > - Make fence_lock a global thing instead of per-device. Probably not what > we want given that dma-buf (and also all the ttm state) has more > fine-grained locking. > > - Remove the fence_lock and protect fences with the reservation lock the > dma-buf already has. Has the appeal of being the simplest solution, at > least if we exclude the One Lock to Rule Them all approach ;-) > > - Add a new per-buffer spinlock just to protect the fences. Could end up > being rather costly for the non-contended common case where we just want > to push tons of buffers through execbuf ioctls. > > - Allow fences attached to dma-bufs to be accessed read-only (to grab > references of them and wait on them) using rcu as protection. I think we > need some trickery with kref_get_unless_zero to make sure the > rcu-delayed freeing of fences doesn't race in bad ways with lockless > kref_gets. Another approach would be to rcu-delay all kref_puts, but I > don't think we want that. > > Personally I prefer the 2nd approach since it's the simplest, while not > being un-scalable like the first. In my experience with the single lock in > i915 where any contention and especially any waiting while holding locks > is massively exagerated is that locking dropping games around the common > gpu wait points are sufficient. Actually in almost all case the fence_lock > wouldn't be sufficient for us since we need to check buffer placement, > mmaps and similar things anyway. > > Now I see that there's valid cases where we want the lowest possible > overhead for waiting on or just checking for outstanding rendering. OpenCL > with it's fancy/explicit synchronization model seems to be the prime > example that usually pops up. For such uses I think it's better to just > directly expose fences to userspace and completely eshew any indirection > through buffer objects. > > That leaves the issue of stalling unrelated processes when they try to > look up fences when another process is heavily thrashing vram and so holds > tons of reservations on non-shared objects and blocks waiting for the gpu > to complete more batches. But from my cursory understanding non-UMA > platforms currently (at least with the drivers we have and the memory > management logic) drop off a rather steep cliff. So I fear that > micro-optimizing this case is investing complexity into the wrong place. > At least in i915 we've implemented quite a pile of tricks to smooth off > the gtt thrashing cliff before even considering improving lock contention > and lock holding times. > > So overall I'm heavily favouring the simple approach of just reusing the > reservation ww mutex to protec fence state, but I'm definitely not > rejecting more complex approaches out of hand. I just think that we should > have solid data to justify the complexity. > > Finally if we can't reach an agreement here I guess we could duct-tape > something together where ttm objects only used by a single driver are > protected by the fence_lock and other, shared buffers are protected by the > reservation. It won't be pretty, but the impact should be fairly > contained. Especially since many paths in ttm currently grab both locks > anyway, so wouldn't need any special handling. > > I hope this helps to move the discussion forward. First, I think in a situation like this with radically different opinions, one needs to be prepared to compromise to move things forward. And not just on the TTM side. And I don't think making the dma-buf fence pointer structure rcu-safe is a big step that is in any way complex. If the sync_object- or fence ops don't support get_unless_zero(), we simply don't use the RCU path. Fence object exporters that do care implement those. /Thomas > > Cheers, Daniel _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel