Re: [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 26 Apr 2017 13:23:51 +0100

On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
> 
> On 26/04/2017 12:18, Chris Wilson wrote:
> >On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
> >>>>>+/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
> >>>>>+ * context id to the last u32 fence seqno waited upon from that context.
> >>>>>+ * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
> >>>>>+ * the root. This allows us to access the whole tree via a single pointer
> >>>>>+ * to the most recently used layer. We expect fence contexts to be dense
> >>>>>+ * and most reuse to be on the same i915_gem_context but on neighbouring
> >>>>>+ * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
> >>>>>+ * effective lookup cache. If the new lookup is not on the same leaf, we
> >>>>>+ * expect it to be on the neighbouring branch.
> >>>>>+ *
> >>>>>+ * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
> >>>>>+ * allows us to store whether a particular seqno is valid (i.e. allows us
> >>>>>+ * to distinguish unset from 0).
> >>>>>+ *
> >>>>>+ * A branch holds an array of layer pointers, and has height > 0, and always
> >>>>>+ * has at least 2 layers (either branches or leaves) below it.
> >>>>>+ *
> >>>>>+ */
> >>>>
> >>>>@_@ :)
> >>>>
> >>>>Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
> >>>>because of u64 keys. :( How about a hash table? It would be much
> >>>>simpler to review. :) Seriously, if it would perform close enough it
> >>>>would be a much much simpler implementation.
> >>>
> >>>You want a resizable hashtable. rht is appallingly slow, so you want a
> >>>custom resizeable ht. They are not as simple as this codewise ;)
> >>>(Plus a compressed radixtree is part of my plan for scalability
> >>>improvements for struct reservation_object.)
> >>
> >>Why resizable? I was thinking a normal one. If at any given time we
> >>have an active set of contexts, or at least lookups are as you say
> >>below, to neighbouring contexts, that would mean we are talking
> >>about lookups to different hash buckets.  And for the typical
> >>working set we would expect many collisions so longer lists in each
> >>bucket? So maybe NUM_ENGINES * some typical load constant number
> >>buckets would not be that bad?
> >
> >Consider a long running display server that will accumulate 10,000s of
> >thousands of clients in its lifetime, each with their own contents that
> >get shared by passing around fences/framebuffers. (Or on a shorter scale
> >any of the context stress tests in igt.) Due to the non-recycling of the
> >context ids, we can grow to very large tables - but we have no knowledge
> >of what contexts are no longer required.
> >
> >To compensate we need to occasionally prune the sync points. For a ht we
> >could just scrap it, For an idr, we could store last use and delete
> >stale leaves.
> 
> Hm, pruning yes.. but you don't have pruning in this patch either.
> So that's something which needs to be addressed either way.

I know, review is great for hindsight. The iterator/remove for the
compressed idr is going to be on the uglier side of the insertion. Ugly
enough to be a seperate patch.

> >But first we have a question of how many buckets do we give the static
> >ht? Most processes will be sharing between 2 contexts (render context,
> >presentation context) except for a display server who may have 10-100 of
> >clients - and possibly where eliminating repeated syncs is going to be
> >most valuable. That suggests 256 buckets for every timeline (assuming
> >just a pair of engines across shared contexts). Overkill for the
> >majority, and going to be miserable in stress tests.
> >
> >Finally what do you store in the ht? Fences are the obvious candidate,
> >but need reaping. Or you just create a ht for the request and only
> >squash repeated fences inside a single request - that doesn't benefit
> >from timeline tracking and squashing between requests (but does avoid
> >keeping fences around forever). Hence why I went with tracking seqno. To
> >avoid allocations for the ht nodes, we could create an open-addressed ht
> >with the {context, seqno} embedded in it. It would be efficient, but
> >needs online resizing and is a fair chunk of new code.
> 
> I was thinking of exactly the same thing as this patch does, u64
> context id as key, u32 seqnos (wrapped in a container with
> hlist_node).

Hmm, a hashed radixtree. I did read something about a hybrid approach.

> Ok, so the key is pruning to keep the display server scenario in check.
> 
> Free the key from i915_fence_release?

Too early, it's the timeline (and syncs along it) that's interesting.
For our contexts, we can hook into context-close, but we still have some
foreign dma-fence-contexts to worry about. I was thinking of walking all
timelines from the idle_worker. And possibly forcibly prune across
suspend.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx