Hi Chris, Am 14.11.2016 um 09:31 schrieb Chris Wilson:
The primary operation on the shared fence arrays is insertion and retrieval. Retrieval is reasonably fast, as we just copy the array, but insertion into the shared fence array is slow as we must iterate over all current fences to discard an older fence. (Note also that since the array is not autopruning, we cannot hook in the fence signal callback due to locking constraints, thus it can grow very, very large.) This, coupled with the atomics to acquire and release the shared fence, causes a severe performance degradation and scaling bottleneck, as felt by i915.ko in commit d07f0e59b2c7 ("drm/i915: Move GEM activity tracking into a common struct reservation_object"). To speed up insertion, we want a direct replacement of the per-context slot. Enter the radix tree. The kernel already has a couple of options for integer radix trees, but both are not suitable for us for a couple of reasons. First we need a per-tree preallocation in order to ensure that adding a fence does not fail. Both radixtree and idr only offer preallocation under a preempt disabled critical section, unsuitable for our existing users. idr has an internal __idr_pre_get that could be exported - but idr does not yet support inserting entries at an arbitrary location. Secondly, neither tree supports full u64 integers as used by the fence context identifiers (though radixtree would support that on 64bit platforms). And finally, if we ignore the API shortcomings, radixtree still appears too high in the profiles! So what are our requirements for the shared fence array? - guaranteed insertion - fast insertion - RCU-safe, fast traversal - 64bit identifiers To guarantee insertion, we need to preallocate enough storage to satisfy a later insertion. The reservation object API requires every add_shared_fence to preceded by a call to reserve_shared_fence. For an uncompressed radix tree we would need to preallocate enough layers to cover the full 64bit height, but with a compressed radix tree we only require a maximum of 2 spare layers. The compressed tree also offers a useful property wrt to the pattern of fences used. The fence contexts are allocated as a pure cyclic atomic64, i.e. it is sparse and ever incrementing. However, timelines tend to cluster (i.e. they will allocate a cluster of fence contexts for themselves and primarily use these). So we may see that we have a mixture of low fence contents and a group of very high contexts (e.g. a group at id<16 and a group at id>65536). The compressed tree avoids having to allocate the intermediate layers (reducing the pointer dancing on lookup and traversal) - still not as dense as the current compact array, but under typical conditions as good as. (However, the current array requires contiguous allocations - a scarce resource for the ever expanding array!) The density could be improved by switching from a simple dangerously wrapping atomic64 to an ida for fence context allocation. Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Yeah, I was thinking about a similar change for a while now cause we clearly see problems with that on amdgpu as well.
That's also the reason amdgpu internally uses a separate container for fences. This implementation is based on a fixed sized hash table and performs well enough for our cases. Might be a good idea to check that one as well, cause it doesn't introduce a new container object just for this use case.
In general on such kind of changes I prefer that the implementation of the container code is added separately and then the existing code changed to use it. That makes reviewing the changes much easier.
Regards, Christian. _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx