Am 21.08.19 um 19:35 schrieb Chris Wilson:
Quoting Chris Wilson (2019-08-21 16:24:22)
Quoting Christian König (2019-08-21 13:31:45)
@@ -117,17 +120,10 @@ i915_gem_busy_ioctl(struct drm_device *dev, void *data,
busy_check_writer(rcu_dereference(obj->base.resv->fence_excl));
/* Translate shared fences to READ set of engines */
- list = rcu_dereference(obj->base.resv->fence);
- if (list) {
- unsigned int shared_count = list->shared_count, i;
-
- for (i = 0; i < shared_count; ++i) {
- struct dma_fence *fence =
- rcu_dereference(list->shared[i]);
-
- args->busy |= busy_check_reader(fence);
- }
- }
+ readers = dma_resv_fences_get_rcu(&obj->base.resv->readers);
+ dma_fence_array_for_each(fence, cursor, readers)
+ args->busy |= busy_check_reader(fence);
+ dma_fence_put(readers);
That's underwhelming, the full-mb shows up in scaling tests (I'll test
the impact of this series later). Something like,
To put some numbers to it, adding the full-mb adds 5ns to a single
thread on Kabylake and 20ns under contention.
The question is if that's the use case we want to optimize for.
Querying a buffer for business is something we do absolutely rarely on
amdgpu, e.g. IIRC we even grab the full reservation lock for this.
But adding new fences comes with every command submission.
What could maybe work is the "do { } while (fence has changed); loop you
suggested earlier in this mail thread, but I need to double check if
that would really work or clash with recycling dma_fence_arrays().
Christian.
-Chris