Quoting Christian König (2017-11-15 17:34:07) > Am 15.11.2017 um 17:55 schrieb Chris Wilson: > > Quoting Chris Wilson (2017-11-14 14:34:05) > >> Quoting Christian König (2017-11-14 14:24:44) > >>> Am 06.11.2017 um 17:22 schrieb Chris Wilson: > >>>> Quoting Christian König (2017-10-30 14:59:04) > >>>>> @@ -126,17 +127,28 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > >>>>> dma_fence_put(old_fence); > >>>>> return; > >>>>> } > >>>>> + > >>>>> + if (!signaled && dma_fence_is_signaled(old_fence)) { > >>>>> + signaled = old_fence; > >>>>> + signaled_idx = i; > >>>>> + } > >>>> How much do we care about only keeping one fence per-ctx here? You could > >>>> rearrange this to break on old_fence->context == fence->context || > >>>> dma_fence_is_signaled(old_fence) then everyone can use the final block. > >>> Yeah, that is what David Zhou suggested as well. > >>> > >>> I've rejected this approach for now cause I think we still have cases > >>> where we rely on one fence per ctx (but I'm not 100% sure). > >>> > >>> I changed patch #1 in this series as you suggest and going to send that > >>> out once more in a minute. > >>> > >>> Can we get this upstream as is for now? I won't have much more time > >>> working on this. > >> Sure, we are only discussing how we might make it look tidier, pure > >> micro-optimisation with the caveat of losing the one-fence-per-ctx > >> guarantee. > > Ah, one thing to note is that extra checking pushed one of our corner > > case tests over its time limit. > > > > If we can completely forgo the one-fence-per-ctx here, what works really > > well in testing is > > > > diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c > > index 5319ac478918..5755e95fab1b 100644 > > --- a/drivers/dma-buf/reservation.c > > +++ b/drivers/dma-buf/reservation.c > > @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > > struct reservation_object_list *fobj, > > struct dma_fence *fence) > > { > > - struct dma_fence *replace = NULL; > > - u32 ctx = fence->context; > > - u32 i; > > - > > dma_fence_get(fence); > > > > preempt_disable(); > > write_seqcount_begin(&obj->seq); > > > > - for (i = 0; i < fobj->shared_count; ++i) { > > - struct dma_fence *check; > > - > > - check = rcu_dereference_protected(fobj->shared[i], > > - reservation_object_held(obj)); > > - > > - if (check->context == ctx || dma_fence_is_signaled(check)) { > > - replace = old_fence; > > - break; > > - } > > - } > > - > > /* > > * memory barrier is added by write_seqcount_begin, > > * fobj->shared_count is protected by this lock too > > */ > > - RCU_INIT_POINTER(fobj->shared[i], fence); > > - if (!replace) > > - fobj->shared_count++; > > + RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence); > > > > write_seqcount_end(&obj->seq); > > preempt_enable(); > > - > > - dma_fence_put(replace); > > } > > > > static void > > > > i.e. don't check when not replacing the shared[], on creating the new > > buffer we then discard all the old fences. > > > > It should work for amdgpu as well since you do a ht to coalesce > > redundant fences before queuing. > > That won't work for all cases. This way the reservation object would > keep growing without a chance to ever shrink. We only keep the active fences when it grows, which is effective enough to keep it in check on the workloads I can find in the hour since noticing the failure in CI ;) And on the workloads where it is being flooded with live fences from many contexts, the order of magnitude throughput improvement is not easy to ignore. -Chris