Re: [PATCH v9] drm/i915: Squash repeated awaits on the same fence

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Thu, 27 Apr 2017 18:25:47 +0100

On Thu, Apr 27, 2017 at 05:47:32PM +0100, Tvrtko Ursulin wrote:
> 
> On 27/04/2017 12:48, Chris Wilson wrote:
> >diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> >index 5fa4e52ded06..d9f76665bc6b 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_request.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_request.c
> >@@ -772,6 +772,12 @@ i915_gem_request_await_dma_fence(struct drm_i915_gem_request *req,
> > 		if (fence->context == req->fence.context)
> > 			continue;
> >
> >+		/* Squash repeated waits to the same timelines */
> >+		if (fence->context != req->i915->mm.unordered_timeline &&
> >+		    intel_timeline_sync_is_later(req->timeline,
> >+						 fence->context, fence->seqno))
> >+			continue;
> 
> Wrong base?

I haven't moved this patch relative to the others in the series? There's
a few patches to get to here first.

> >+struct intel_timeline_sync {
> >+	u64 prefix;
> >+	unsigned int height;
> >+	unsigned int bitmap;
> 
> u16 would be enough for the bitmap since NSYNC == 16? To no benefit
> though. Maybe just add a BUILD_BUG_ON(sizeof(p->bitmap) *
> BITS_PER_BYTE >= NSYNC) somewhere?

Indeed compacting these bits have no impact on allocation size, so I
went with natural sizes. But I didn't check if the compiler prefers u16.

> >+	struct intel_timeline_sync *parent;
> >+	/* union {
> >+	 *	u32 seqno;
> >+	 *	struct intel_timeline_sync *child;
> >+	 * } slot[NSYNC];
> >+	 */
> 
> Put a note saying this comment describes what follows after struct
> intel_timeline_sync.
> 
> Would "union { ... } slot[0];" work as a maker and have any benefit
> to the readability of the code below?
> 
> You could same some bytes (64 I think) for the leaf nodes if you did
> something like:

Hmm, where's the saving?

leaves are sizeof(*p) + NSYNC*sizeof(seqno) -> kmalloc-128 slab
branches are sizeof(*p) + NSYNC*sizeof(p) -> kmalloc-256 slab

> 	union {
> 		u32 seqno[NSYNC];
> 		struct intel_timeline_sync *child[NSYNC];
> 	};
> 
> Although I think it conflicts with the slot marker idea. Hm, no
> actually it doesn't. You could have both union members as simply
> markers.
> 
> 	union {
> 		u32 seqno[];
> 		struct intel_timeline_sync *child[];
> 	};
> 
> Again, not sure yet if it would make that much better readability.

Tried, gcc doesn't like unions of variable length arrays. Hence
resorting to manual packing the arrays after the struct.

> >+static void __sync_free(struct intel_timeline_sync *p)
> >+{
> >+	if (p->height) {
> >+		unsigned int i;
> >+
> >+		while ((i = ffs(p->bitmap))) {
> >+			p->bitmap &= ~0u << i;
> >+			__sync_free(__sync_child(p)[i - 1]);
> 
> Maximum height is 64 for this tree so here there is no danger of
> stack overflow?

Maximum recusion depth is 64 / NSHIFT(4) = 16. Stack usage is small,
only a few registers to push pop, so I didn't feel any danger in
allowing recursion.

The while() loop was chosen as that avoided a stack variable.

> >+	/* First climb the tree back to a parent branch */
> >+	do {
> >+		p = p->parent;
> >+		if (!p)
> >+			return false;
> >+
> >+		if ((id >> p->height >> SHIFT) == p->prefix)
> 
> Worth having "id >> p->height >> SHIFT" as a macro for better readability?

Yeah, this is the main issue with the code, so many shifts.

> >+			break;
> >+	} while (1);
> >+
> >+	/* And then descend again until we find our leaf */
> >+	do {
> >+		if (!p->height)
> >+			break;
> >+
> >+		p = __sync_child(p)[__sync_idx(p, id)];
> >+		if (!p)
> >+			return false;
> >+
> >+		if ((id >> p->height >> SHIFT) != p->prefix)
> >+			return false;
> 
> Is this possible or a GEM_BUG_ON? Maybe I am not understanding it,
> but I thought it would be __sync_child slot had unexpected prefix in
> it?

The tree may skip levels.

> >+	} while (1);
> >+
> >+	tl->sync = p;
> >+found:
> >+	idx = id & MASK;
> >+	if (!(p->bitmap & BIT(idx)))
> >+		return false;
> >+
> >+	return i915_seqno_passed(__sync_seqno(p)[idx], seqno);
> >+}
> >+
> >+static noinline int
> >+__intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> >+{
> >+	struct intel_timeline_sync *p = tl->sync;
> >+	unsigned int idx;
> >+
> >+	if (!p) {
> >+		p = kzalloc(sizeof(*p) + NSYNC * sizeof(seqno), GFP_KERNEL);
> >+		if (unlikely(!p))
> >+			return -ENOMEM;
> >+
> >+		p->prefix = id >> SHIFT;
> >+		goto found;
> >+	}
> >+
> >+	/* Climb back up the tree until we find a common prefix */
> >+	do {
> >+		if (!p->parent)
> >+			break;
> >+
> >+		p = p->parent;
> >+
> >+		if ((id >> p->height >> SHIFT) == p->prefix)
> >+			break;
> >+	} while (1);
> 
> __climb_back_to_prefix(p, id) as a helper since it is used in the
> lookup as well?

The two climbers were subtly different. :(

> >+	/* No shortcut, we have to descend the tree to find the right layer
> >+	 * containing this fence.
> >+	 *
> >+	 * Each layer in the tree holds 16 (NSYNC) pointers, either fences
> >+	 * or lower layers. Leaf nodes (height = 0) contain the fences, all
> >+	 * other nodes (height > 0) are internal layers that point to a lower
> >+	 * node. Each internal layer has at least 2 descendents.
> >+	 *
> >+	 * Starting at the top, we check whether the current prefix matches. If
> >+	 * it doesn't, we have gone passed our layer and need to insert a join
> >+	 * into the tree, and a new leaf node as a descendent as well as the
> >+	 * original layer.
> >+	 *
> >+	 * The matching prefix means we are still following the right branch
> >+	 * of the tree. If it has height 0, we have found our leaf and just
> >+	 * need to replace the fence slot with ourselves. If the height is
> >+	 * not zero, our slot contains the next layer in the tree (unless
> >+	 * it is empty, in which case we can add ourselves as a new leaf).
> >+	 * As descend the tree the prefix grows (and height decreases).
> >+	 */
> >+	do {
> >+		struct intel_timeline_sync *next;
> >+
> >+		if ((id >> p->height >> SHIFT) != p->prefix) {
> >+			/* insert a join above the current layer */
> >+			next = kzalloc(sizeof(*next) + NSYNC * sizeof(next),
> >+				       GFP_KERNEL);
> >+			if (unlikely(!next))
> >+				return -ENOMEM;
> >+
> >+			next->height = ALIGN(fls64((id >> p->height >> SHIFT) ^ p->prefix),
> >+					    SHIFT) + p->height;
> 
> Got lost here - what's xor-ing accomplishing here? What is height
> then, not just depth relative to the bottom of the tree?

It's working out at what height these two prefixes differ. That's the
height immediately above which we need to insert the join.

> >+int intel_timeline_sync_set(struct intel_timeline *tl, u64 id, u32 seqno)
> >+{
> >+	struct intel_timeline_sync *p = tl->sync;
> >+
> >+	/* We expect to be called in sequence following a  _get(id), which
> >+	 * should have preloaded the tl->sync hint for us.
> >+	 */
> >+	if (likely(p && (id >> SHIFT) == p->prefix)) {
> >+		unsigned int idx = id & MASK;
> >+
> >+		__sync_seqno(p)[idx] = seqno;
> >+		p->bitmap |= BIT(idx);
> >+		return 0;
> >+	}
> >+
> >+	return __intel_timeline_sync_set(tl, id, seqno);
> 
> Could pass in p and set tl->sync = p at this level. That would
> decouple the algorithm from the timeline better. With equivalent
> treatment for the query, and renaming of struct intel_timeline_sync,
> algorithm would be ready for moving out of drm/i915/ :)

I really did want to keep this as a tail call to keep the fast path neat
and tidy with minimal stack manipulation.

> >diff --git a/drivers/gpu/drm/i915/i915_gem_timeline.h b/drivers/gpu/drm/i915/i915_gem_timeline.h
> >index 6c53e14cab2a..e16a62bc21e6 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_timeline.h
> >+++ b/drivers/gpu/drm/i915/i915_gem_timeline.h
> >@@ -26,10 +26,13 @@
> > #define I915_GEM_TIMELINE_H
> >
> > #include <linux/list.h>
> >+#include <linux/radix-tree.h>
> 
> What is used from it?

Stray. I started with common idr/radixtree, then plonked in the u64 idr
I had for reservation_object.

> >+static int igt_sync(void *arg)
> >+{
> >+	struct drm_i915_private *i915 = arg;
> >+	const struct {
> >+		const char *name;
> >+		u32 seqno;
> >+		bool expected;
> >+		bool set;
> >+	} pass[] = {
> >+		{ "unset", 0, false, false },
> >+		{ "new", 0, false, true },
> >+		{ "0a", 0, true, true },
> >+		{ "1a", 1, false, true },
> >+		{ "1b", 1, true, true },
> >+		{ "0b", 0, true, false },
> >+		{ "2a", 2, false, true },
> >+		{ "4", 4, false, true },
> >+		{ "INT_MAX", INT_MAX, false, true },
> >+		{ "INT_MAX-1", INT_MAX-1, true, false },
> >+		{ "INT_MAX+1", (u32)INT_MAX+1, false, true },
> >+		{ "INT_MAX", INT_MAX, true, false },
> >+		{ "UINT_MAX", UINT_MAX, false, true },
> >+		{ "wrap", 0, false, true },
> >+		{ "unwrap", UINT_MAX, true, false },
> >+		{},
> >+	}, *p;
> >+	struct i915_gem_timeline *timeline;
> >+	struct intel_timeline *tl;
> >+	int order, offset;
> >+	int ret;
> >+
> >+	timeline = mock_timeline(i915);
> 
> Hey-ho.. when I suggested mock_timeline I did not realize we need
> the mock_device anyway. :( For struct_mutex I guess? Oh well, that
> was an useless suggestion.. :(

It's only used for an assertion which is annoying. I have been
contemplating making the mock_timeline be intel_timeline and then can
remove the mock_gem_device.

> >+	tl = &timeline->engine[RCS];
> >+	for (p = pass; p->name; p++) {
> >+		for (order = 1; order < 64; order++) {
> >+			for (offset = -1; offset <= (order > 1); offset++) {
> >+				u64 ctx = BIT_ULL(order) + offset;
> >+
> >+				if (intel_timeline_sync_is_later
> >+				    (tl, ctx, p->seqno) != p->expected) {
> 
> Unusual formatting.

I prefer it over having arguments on each line that still go over 80col.
Yes it probably means I should break the loops into functions.

> >+static int bench_sync(void *arg)
> >+{
> >+	struct drm_i915_private *i915 = arg;
> >+	struct rnd_state prng;
> >+	struct i915_gem_timeline *timeline;
> >+	struct intel_timeline *tl;
> >+	unsigned long end_time, count;
> >+	ktime_t kt;
> >+	int ret;
> >+
> >+	timeline = mock_timeline(i915);
> >+	if (!timeline)
> >+		return -ENOMEM;
> >+
> >+	prandom_seed_state(&prng, i915_selftest.random_seed);
> >+	tl = &timeline->engine[RCS];
> >+
> >+	count = 0;
> >+	kt = -ktime_get();
> >+	end_time = jiffies + HZ/10;
> >+	do {
> >+		u64 id = prandom_u64_state(&prng);
> >+
> >+		intel_timeline_sync_set(tl, id, 0);
> >+		count++;
> >+	} while (!time_after(jiffies, end_time));
> >+	kt = ktime_add(ktime_get(), kt);
> 
> Why not ktime_sub? I don't know if ktime_t is signed or not.

It's s64, but I was just using a pattern I'm familar with.

> >+
> >+	pr_info("%s: %lu random insertions, %lluns/insert\n",
> >+		__func__, count, (long long)div64_ul(ktime_to_ns(kt), count));
> >+
> >+	prandom_seed_state(&prng, i915_selftest.random_seed);
> >+
> >+	end_time = count;
> >+	kt = -ktime_get();
> >+	while (end_time--) {
> 
> This is a new pattern for me - why not simply go by time in every
> test? You have to be sure lookups are not a bazillion times faster
> than insertions like this.

It's because I wanted to measure known id. After inserting N id, we
reset the prng and then time how long it takes to look up the same N.

I don't particular want to do (insert+lookup) as that is favoured by the
caching, and we measure it seperately later.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx