Re: [PATCH] dma-buf: Replace reservation shared fence array with a compressed radix tree

Christian König <christian.koenig@xxxxxxx> · Mon, 14 Nov 2016 10:59:57 +0100

Hi Chris,

Am 14.11.2016 um 09:31 schrieb Chris Wilson:
The primary operation on the shared fence arrays is insertion and
retrieval.  Retrieval is reasonably fast, as we just copy the array, but
insertion into the shared fence array is slow as we must iterate over all
current fences to discard an older fence.  (Note also that since the
array is not autopruning, we cannot hook in the fence signal callback due
to locking constraints, thus it can grow very, very large.)  This, coupled
with the atomics to acquire and release the shared fence, causes a severe
performance degradation and scaling bottleneck, as felt by i915.ko in
commit d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common
struct reservation_object").  To speed up insertion, we want a direct
replacement of the per-context slot. Enter the radix tree.

The kernel already has a couple of options for integer radix trees, but
both are not suitable for us for a couple of reasons.  First we need a
per-tree preallocation in order to ensure that adding a fence does not
fail.  Both radixtree and idr only offer preallocation under a preempt
disabled critical section, unsuitable for our existing users.  idr has
an internal __idr_pre_get that could be exported - but idr does not yet
support inserting entries at an arbitrary location.  Secondly, neither
tree supports full u64 integers as used by the fence context identifiers
(though radixtree would support that on 64bit platforms).  And finally,
if we ignore the API shortcomings, radixtree still appears too high in
the profiles!

So what are our requirements for the shared fence array?

	- guaranteed insertion
	- fast insertion
	- RCU-safe, fast traversal
	- 64bit identifiers

To guarantee insertion, we need to preallocate enough storage to satisfy
a later insertion.  The reservation object API requires every
add_shared_fence to preceded by a call to reserve_shared_fence.  For an
uncompressed radix tree we would need to preallocate enough layers to cover
the full 64bit height, but with a compressed radix tree we only require a
maximum of 2 spare layers.

The compressed tree also offers a useful property wrt to the pattern of
fences used.  The fence contexts are allocated as a pure cyclic atomic64,
i.e. it is sparse and ever incrementing.  However, timelines tend to
cluster (i.e. they will allocate a cluster of fence contexts for
themselves and primarily use these).  So we may see that we have a
mixture of low fence contents and a group of very high contexts (e.g.
a group at id<16 and a group at id>65536).  The compressed tree avoids
having to allocate the intermediate layers (reducing the pointer dancing
on lookup and traversal) - still not as dense as the current compact
array, but under typical conditions as good as.  (However, the current
array requires contiguous allocations - a scarce resource for the ever
expanding array!)  The density could be improved by switching from a
simple dangerously wrapping atomic64 to an ida for fence context
allocation.

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

Yeah, I was thinking about a similar change for a while now cause we 
clearly see problems with that on amdgpu as well.

That's also the reason amdgpu internally uses a separate container for 
fences. This implementation is based on a fixed sized hash table and 
performs well enough for our cases. Might be a good idea to check that 
one as well, cause it doesn't introduce a new container object just for 
this use case.

In general on such kind of changes I prefer that the implementation of 
the container code is added separately and then the existing code 
changed to use it. That makes reviewing the changes much easier.

Regards,
Christian.
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel