On Fri, Apr 28, 2017 at 10:32:58AM +0100, Tvrtko Ursulin wrote: > > On 28/04/2017 08:41, Chris Wilson wrote: > >Track the latest fence waited upon on each context, and only add a new > >asynchronous wait if the new fence is more recent than the recorded > >fence for that context. This requires us to filter out unordered > >timelines, which are noted by DMA_FENCE_NO_CONTEXT. However, in the > >absence of a universal identifier, we have to use our own > >i915->mm.unordered_timeline token. > > > >v2: Throw around the debug crutches > >v3: Inline the likely case of the pre-allocation cache being full. > >v4: Drop the pre-allocation support, we can lose the most recent fence > >in case of allocation failure -- it just means we may emit more awaits > >than strictly necessary but will not break. > >v5: Trim allocation size for leaf nodes, they only need an array of u32 > >not pointers. > >v6: Create mock_timeline to tidy selftest writing > >v7: s/intel_timeline_sync_get/intel_timeline_sync_is_later/ (Tvrtko) > >v8: Prune the stale sync points when we idle. > >v9: Include a small benchmark in the kselftests > >v10: Separate the idr implementation into its own compartment. > > FYI I am reading v11 and commenting here. Hopefully it works out. :) > > > > >Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> > >--- > > drivers/gpu/drm/i915/Makefile | 1 + > > drivers/gpu/drm/i915/i915_gem.c | 1 + > > drivers/gpu/drm/i915/i915_gem.h | 2 + > > drivers/gpu/drm/i915/i915_gem_request.c | 9 + > > drivers/gpu/drm/i915/i915_gem_timeline.c | 92 +++++- > > drivers/gpu/drm/i915/i915_gem_timeline.h | 36 ++ > > drivers/gpu/drm/i915/i915_syncmap.c | 362 +++++++++++++++++++++ > > drivers/gpu/drm/i915/i915_syncmap.h | 39 +++ > > drivers/gpu/drm/i915/selftests/i915_gem_timeline.c | 257 +++++++++++++++ > > .../gpu/drm/i915/selftests/i915_mock_selftests.h | 1 + > > drivers/gpu/drm/i915/selftests/mock_timeline.c | 45 +++ > > drivers/gpu/drm/i915/selftests/mock_timeline.h | 33 ++ > > 12 files changed, 860 insertions(+), 18 deletions(-) > > create mode 100644 drivers/gpu/drm/i915/i915_syncmap.c > > create mode 100644 drivers/gpu/drm/i915/i915_syncmap.h > > create mode 100644 drivers/gpu/drm/i915/selftests/i915_gem_timeline.c > > create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.c > > create mode 100644 drivers/gpu/drm/i915/selftests/mock_timeline.h > > > >diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile > >index 2cf04504e494..7b05fb802f4c 100644 > >--- a/drivers/gpu/drm/i915/Makefile > >+++ b/drivers/gpu/drm/i915/Makefile > >@@ -16,6 +16,7 @@ i915-y := i915_drv.o \ > > i915_params.o \ > > i915_pci.o \ > > i915_suspend.o \ > >+ i915_syncmap.o \ > > i915_sw_fence.o \ > > i915_sysfs.o \ > > intel_csr.o \ > >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >index c1fa3c103f38..f886ef492036 100644 > >--- a/drivers/gpu/drm/i915/i915_gem.c > >+++ b/drivers/gpu/drm/i915/i915_gem.c > >@@ -3214,6 +3214,7 @@ i915_gem_idle_work_handler(struct work_struct *work) > > intel_engine_disarm_breadcrumbs(engine); > > i915_gem_batch_pool_fini(&engine->batch_pool); > > } > >+ i915_gem_timelines_mark_idle(dev_priv); > > > > GEM_BUG_ON(!dev_priv->gt.awake); > > dev_priv->gt.awake = false; > >diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h > >index 5a49487368ca..ee54597465b6 100644 > >--- a/drivers/gpu/drm/i915/i915_gem.h > >+++ b/drivers/gpu/drm/i915/i915_gem.h > >@@ -25,6 +25,8 @@ > > #ifndef __I915_GEM_H__ > > #define __I915_GEM_H__ > > > >+#include <linux/bug.h> > >+ > > #ifdef CONFIG_DRM_I915_DEBUG_GEM > > #define GEM_BUG_ON(expr) BUG_ON(expr) > > #define GEM_WARN_ON(expr) WARN_ON(expr) > >diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c > >index 5fa4e52ded06..807fc1b65dd1 100644 > >--- a/drivers/gpu/drm/i915/i915_gem_request.c > >+++ b/drivers/gpu/drm/i915/i915_gem_request.c > >@@ -772,6 +772,11 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req, > > if (fence->context == req->fence.context) > > continue; > > > >+ /* Squash repeated waits to the same timelines */ > >+ if (fence->context != req->i915->mm.unordered_timeline && > >+ intel_timeline_sync_is_later(req->timeline, fence)) > >+ continue; > >+ > > if (dma_fence_is_i915(fence)) > > ret = i915_gem_request_await_request(req, > > to_request(fence)); > >@@ -781,6 +786,10 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req, > > GFP_KERNEL); > > if (ret < 0) > > return ret; > >+ > >+ /* Record the latest fence used against each timeline */ > >+ if (fence->context != req->i915->mm.unordered_timeline) > >+ intel_timeline_sync_set(req->timeline, fence); > > } while (--nchild); > > > > return 0; > >diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.c b/drivers/gpu/drm/i915/i915_gem_timeline.c > >index b596ca7ee058..a28a65db82e9 100644 > >--- a/drivers/gpu/drm/i915/i915_gem_timeline.c > >+++ b/drivers/gpu/drm/i915/i915_gem_timeline.c > >@@ -24,6 +24,31 @@ > > > > #include "i915_drv.h" > > #include "i915_syncmap.h"? > > I think the over-reliance on i915_drv.h being an include all is > hurting us in some cases, and even though it is probably very hard > to untangle, perhaps it is worth making new source files cleaner in > that respect. It's pulled in via i915_gem_timeline.h currently, which appears itself pulled in via i915_drv.h. Sigh, yes this is a case where i915_drv.h hid an error. > > static int __i915_gem_timeline_init(struct drm_i915_private *i915, > > struct i915_gem_timeline *timeline, > > const char *name, > >@@ -35,6 +60,12 @@ static int __i915_gem_timeline_init(struct drm_i915_private *i915, > > > > lockdep_assert_held(&i915->drm.struct_mutex); > > > >+ /* > >+ * Ideally we want a set of engines on a single leaf as we expect > >+ * to mostly be tracking synchronisation between engines. > >+ */ > >+ BUILD_BUG_ON(KSYNCMAP < I915_NUM_ENGINES); > > Maybe also add BUILD_BUG_ON(!is_power_of_2(KSYNCMAP)) just in case. Can do over in syncmap_init. I figured we would hit a few BUILD_BUG_ON_NO_POWER_OF_TWO() anyway, but being explicit is sensible. > >+/** > >+ * i915_syncmap_is_later -- compare against the last know sync point > >+ * @root - pointer to the #i915_syncmap > >+ * @id - the context id (other timeline) we are synchronising to > >+ * @seqno - the sequence number along the other timeline > >+ * > >+ * If we have already synchronised this @root with another (@id) then we can > >+ * omit any repeated or earlier synchronisation requests. If the two timelines > >+ * are already coupled, we can also omit the dependency between the two as that > >+ * is already known via the timeline. > >+ * > >+ * Returns true if the two timelines are already synchronised wrt to @seqno, > >+ * false if not and the synchronisation must be emitted. > >+ */ > >+bool i915_syncmap_is_later(struct i915_syncmap **root, u64 id, u32 seqno) > >+{ > >+ struct i915_syncmap *p; > >+ unsigned int idx; > >+ > >+ p = *root; > >+ if (!p) > >+ return false; > >+ > >+ if (likely(__sync_leaf(p, id) == p->prefix)) > >+ goto found; > > Are you sure likely is appropriate here? Yes, it is primarily documenting the intent. If it fails the likely() prediction, the idea of caching is mostly moot. > > >+ > >+ /* First climb the tree back to a parent branch */ > >+ do { > >+ p = p->parent; > >+ if (!p) > >+ return false; > >+ > >+ if (__sync_prefix(p, id) == p->prefix) > >+ break; > >+ } while (1); > >+ > >+ /* And then descend again until we find our leaf */ > >+ do { > >+ if (!p->height) > >+ break; > >+ > >+ p = __sync_child(p)[__sync_idx(p, id)]; > >+ if (!p) > >+ return false; > >+ > >+ if (__sync_prefix(p, id) != p->prefix) > >+ return false; > >+ } while (1); > >+ > >+ *root = p; > >+found: > >+ idx = id & MASK; > > Would: > > GEM_BUG_ON(p->height); > idx = __sync_idx(p, id); > > be correct (even if more verbose) here instead of idx = id & MASK? Yes. But it will incur a shift (gcc will not know p->height is 0 in !debug builds). So __sync_leaf_idx(). > > >+ if (!(p->bitmap & BIT(idx))) > >+ return false; > > I was thinking briefly whether seqno+bitmap get/set helpers would be > helpful but I think there is no need. With the __sync_*_prefix > algorithm is much more readable already. > > >+ > >+ return seqno_later(__sync_seqno(p)[idx], seqno); > >+} > >+ > >+static struct i915_syncmap * > >+__sync_alloc_leaf(struct i915_syncmap *parent, u64 id) > >+{ > >+ struct i915_syncmap *p; > >+ > >+ p = kmalloc(sizeof(*p) + KSYNCMAP * sizeof(u32), GFP_KERNEL); > >+ if (unlikely(!p)) > >+ return NULL; > >+ > >+ p->parent = parent; > >+ p->height = 0; > >+ p->bitmap = 0; > >+ p->prefix = __sync_leaf(p, id); > >+ return p; > >+} > >+ > >+static noinline int > >+__i915_syncmap_set(struct i915_syncmap **root, u64 id, u32 seqno) > >+{ > >+ struct i915_syncmap *p = *root; > >+ unsigned int idx; > >+ > >+ if (!p) { > >+ p = __sync_alloc_leaf(NULL, id); > >+ if (unlikely(!p)) > >+ return -ENOMEM; > >+ > >+ goto found; > >+ } > >+ > > GEM_BUG_ON(p->prefix == __sync_leaf_prefix(p, id)) ? Or maybe better > try to handle it rather than expect caller will never do that? By > handling it I mean not immediately strt climbing the tree just below > but check for the condition first and goto found. Hmm, no. s/__i915_syncmap_set/__sync_set/, we shouldn't treat this as being in isolation, it is really a branch of i915_syncmap_set split out to micro-optimise the likely path of i915_syncmap_set. So GEM_BUG_ON() preconditions is fine. > >+ /* Climb back up the tree until we find a common prefix */ > >+ do { > >+ if (!p->parent) > >+ break; > >+ > >+ p = p->parent; > >+ > >+ if (__sync_prefix(p, id) == p->prefix) > >+ break; > >+ } while (1); > >+ > >+ /* > >+ * No shortcut, we have to descend the tree to find the right layer > >+ * containing this fence. > >+ * > >+ * Each layer in the tree holds 16 (KSYNCMAP) pointers, either fences > >+ * or lower layers. Leaf nodes (height = 0) contain the fences, all > >+ * other nodes (height > 0) are internal layers that point to a lower > >+ * node. Each internal layer has at least 2 descendents. > >+ * > >+ * Starting at the top, we check whether the current prefix matches. If > >+ * it doesn't, we have gone passed our layer and need to insert a join > >+ * into the tree, and a new leaf node as a descendent as well as the > >+ * original layer. > >+ * > >+ * The matching prefix means we are still following the right branch > >+ * of the tree. If it has height 0, we have found our leaf and just > >+ * need to replace the fence slot with ourselves. If the height is > >+ * not zero, our slot contains the next layer in the tree (unless > >+ * it is empty, in which case we can add ourselves as a new leaf). > >+ * As descend the tree the prefix grows (and height decreases). > >+ */ > >+ do { > >+ struct i915_syncmap *next; > >+ > >+ if (__sync_prefix(p, id) != p->prefix) { > >+ unsigned int above; > >+ > >+ /* insert a join above the current layer */ > >+ next = kzalloc(sizeof(*next) + KSYNCMAP * sizeof(next), > >+ GFP_KERNEL); > > Next is above, right? Would common_parent be correct? Or > lowest_common_parent? Possibly not the name because it is too long, > but I'm trying to figure out if I got the fls and xor business > right. Problem is next here is parent, next later on is child. So next for lack of a better name. > >+ if (unlikely(!next)) > >+ return -ENOMEM; > >+ > >+ above = fls64(__sync_prefix(p, id) ^ p->prefix); > >+ above = round_up(above, SHIFT); > >+ next->height = above + p->height; > >+ next->prefix = __sync_prefix(next, id); > >+ > >+ if (p->parent) > >+ __sync_child(p->parent)[__sync_idx(p->parent, id)] = next; > >+ next->parent = p->parent; > >+ > >+ idx = p->prefix >> (above - SHIFT) & MASK; > >+ __sync_child(next)[idx] = p; > >+ next->bitmap |= BIT(idx); > >+ p->parent = next; > >+ > >+ /* ascend to the join */ > >+ p = next; > >+ } else { > >+ if (!p->height) > >+ break; > >+ } > >+ > >+ /* descend into the next layer */ > >+ GEM_BUG_ON(!p->height); > >+ idx = __sync_idx(p, id); > >+ next = __sync_child(p)[idx]; > >+ if (unlikely(!next)) { > > Why is this one unlikely? Iirc, I was painting the malloc failure paths at the time. So this is a silly one. > >+ next = __sync_alloc_leaf(p, id); > >+ if (unlikely(!next)) > >+ return -ENOMEM; > >+ > >+ __sync_child(p)[idx] = next; > >+ p->bitmap |= BIT(idx); > >+ > >+ p = next; > >+ break; > >+ } > >+ > >+ p = next; > >+ } while (1); > >+ > >+found: > >+ GEM_BUG_ON(p->prefix != __sync_leaf(p, id)); > >+ idx = id & MASK; > >+ __sync_seqno(p)[idx] = seqno; > >+ p->bitmap |= BIT(idx); > > Actually __sync_set_seqno(p, id, seqno) might be useful here and in > i915_syncmap_set below. > > The below looks OK for the moment. Still getting to terms with the > above loop. Postponing drawing the diagram.. :) If you have a pretty ascii diagram, I'll paste it in! -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx