On Mon, Apr 24, 2017 at 01:44:53PM +0100, Tvrtko Ursulin wrote: > > On 24/04/2017 12:07, Chris Wilson wrote: > >On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote: > >> > >>On 19/04/2017 10:41, Chris Wilson wrote: > >>>All the requests at the same priority are executed in FIFO order. They > >>>do not need to be stored in the rbtree themselves, as they are a simple > >>>list within a level. If we move the requests at one priority into a list, > >>>we can then reduce the rbtree to the set of priorities. This should keep > >>>the height of the rbtree small, as the number of active priorities can not > >>>exceed the number of active requests and should be typically only a few. > >>> > >>>Currently, we have ~2k possible different priority levels, that may > >>>increase to allow even more fine grained selection. Allocating those in > >>>advance seems a waste (and may be impossible), so we opt for allocating > >>>upon first use, and freeing after its requests are depleted. To avoid > >>>the possibility of an allocation failure causing us to lose a request, > >>>we preallocate the default priority (0) and bump any request to that > >>>priority if we fail to allocate it the appropriate plist. Having a > >>>request (that is ready to run, so not leading to corruption) execute > >>>out-of-order is better than leaking the request (and its dependency > >>>tree) entirely. > >>> > >>>There should be a benefit to reducing execlists_dequeue() to principally > >>>using a simple list (and reducing the frequency of both rbtree iteration > >>>and balancing on erase) but for typical workloads, request coalescing > >>>should be small enough that we don't notice any change. The main gain is > >>>from improving PI calls to schedule, and the explicit list within a > >>>level should make request unwinding simpler (we just need to insert at > >>>the head of the list rather than the tail and not have to make the > >>>rbtree search more complicated). > >> > >>Sounds attractive! What workloads show the benefit and how much? > > > >The default will show the best, since everything is priority 0 more or > >less and so we reduce the rbtree search to a single lookup and list_add. > >It's hard to measure the impact of the rbtree though. On the dequeue > >side, the mmio access dominates. On the schedule side, if we have lots > >of requests, the dfs dominates. > > > >I have an idea on how we might stress the rbtree in submit_request - but > >still it requires long queues untypical of most workloads. Still tbd. > > > >>>-static bool insert_request(struct i915_priotree *pt, struct rb_root *root) > >>>+static bool > >>>+insert_request(struct intel_engine_cs *engine, > >>>+ struct i915_priotree *pt, > >>>+ int prio) > >>>{ > >>>+ struct execlist_priolist *plist; > >>> struct rb_node **p, *rb; > >>> bool first = true; > >>> > >>>+find_plist: > >>> /* most positive priority is scheduled first, equal priorities fifo */ > >>> rb = NULL; > >>>- p = &root->rb_node; > >>>+ p = &engine->execlist_queue.rb_node; > >>> while (*p) { > >>>- struct i915_priotree *pos; > >>>- > >>> rb = *p; > >>>- pos = rb_entry(rb, typeof(*pos), node); > >>>- if (pt->priority > pos->priority) { > >>>+ plist = rb_entry(rb, typeof(*plist), node); > >>>+ if (prio > plist->priority) { > >>> p = &rb->rb_left; > >>>- } else { > >>>+ } else if (prio < plist->priority) { > >>> p = &rb->rb_right; > >>> first = false; > >>>+ } else { > >>>+ list_add_tail(&pt->link, &plist->requests); > >>>+ return false; > >>> } > >>> } > >>>- rb_link_node(&pt->node, rb, p); > >>>- rb_insert_color(&pt->node, root); > >>>+ > >>>+ if (!prio) { > >>>+ plist = &engine->default_priolist; > >> > >>Should be "prio == I915_PRIO_DEFAULT" (give or take). > >> > >>But I am not completely happy with special casing the default > >>priority for two reasons. > >> > >>Firstly, userspace can opt to lower its priority and completely > >>defeat this path. > >> > >>Secondly, we already have flip priority which perhaps should have > >>it's own fast path / avoid allocation as well. > >> > >>Those two combined make me unsure whether the optimisation is worth > >>it. What would be the pros and cons of three steps: > >> > >>1. No optimisation. > >>2. prio == default optimisation like above. > >>3. Better system with caching of frequently used levels. > >> > >>Last is definitely complicated, second is not, but is the second > >>much better than the first? > > > >It was not intended as an optimisation. It is for handling the > >ENOMEM here. We cannot abort the request at such a late stage, so we > >need somewhere to hold it. That dictated having a preallocted slot. I > >also didn't like having to preallocate all possible levels as that seems > >a waste, especially as I like to invent new levels and suspect that we > >may end up using a full u32 range. > > > >Using it for the default priority was then to take advantage of the > >preallocation. > > > >>Perhaps a simplification of 3) where we would defer the freeing of > >>unused priority levels until the busy to idle transition? That would > >>also drop the existence and need for special handling of > >>engine->default_prio. > >> > >>>+ } else { > >>>+ plist = kmalloc(sizeof(*plist), GFP_ATOMIC); > >>>+ /* Convert an allocation failure to a priority bump */ > >> > >>Where is the priority bump? It looks like it can be the opposite for > >>high prio requests below. > > > >Correct. Bump was the best verb I thought of. > > > >>I don't think it matters what happens with priorities hugely when > >>small allocations start to go bad but would like to understand the > >>comment. > >> > >>And perhaps this would be worthy of a dedicated slab cache? > > > >Even with a slab cache, we cannot prevent allocation failure. I don't > >think priority levels will be frequent enough to really justify one. > >Should be a good match for the common kmalloc-64 slab. > > We could keep a pre-allocated entry with each engine which would > transfer ownership with insert_request. It would have to be > allocated at a point where we can fail like request_alloc, but > downside would be starting to take engine timeline lock in request > alloc path. Only to check and preallocate if needed, but still. And > it would mean more traffic on the slab API in that path as well. Oh > well, not very nice. Was just thinking if we can avoid GFP_ATOMIC > and the default priority fallback. It seems like your solution is a > better compromise. No worries. I didn't particular like the idea of reserving a slot with each request either or having a GFP_ATOMIC nestled so deep in request submission. It is certainly possible for us to do the allocation at request_alloc and carry it through to schedule - but still that only covers the first call to schedule. It seems like we would have to resort to always pass in a slot, and free that slot if unused. > A couple more question on the patch details then. > > Could you implement the list handling in a more obvious way, instead > of link.next == link.prev use a more obvious list_empty on the > plist->requests, why __list_del_entry and not just list_del and you > have a list_add_tail as well which could be just list_add since the > list is empty at that point and _tail falsely suggests the _tail is > important. After a few more passes, it is saner now (with just one debatable list handling tweak from ages ago) static inline void __list_del_many(struct list_head *head, struct list_head *first) { head->next = first; first->prev = head; } ... rb = engine->execlist_first; GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb); while (rb) { struct execlist_priolist *plist = rb_entry(rb, typeof(*plist), node); struct drm_i915_gem_request *rq, *rn; list_for_each_entry_safe(rq, rn, &plist->requests, priotree.link) { if(!merge(rq)) { /* blah */ __list_del_many(&plist->requests, &rq->priotree.link); goto done; } INIT_LIST_HEAD(&rq->priotree.link); rq->priotree.priority = INT_MAX; ... /*__i915_gem_request_submit(rq)); */ } rb = rb_next(rb); rb_erase(&plist->node, &engine->execlist_queue); INIT_LIST_HEAD(&plist->requests); if (plist->priority) kfree(plist) } You may notice I have a dislike of the cache misses from lists. :| -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx