On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote: > > On 19/04/2017 10:41, Chris Wilson wrote: > >All the requests at the same priority are executed in FIFO order. They > >do not need to be stored in the rbtree themselves, as they are a simple > >list within a level. If we move the requests at one priority into a list, > >we can then reduce the rbtree to the set of priorities. This should keep > >the height of the rbtree small, as the number of active priorities can not > >exceed the number of active requests and should be typically only a few. > > > >Currently, we have ~2k possible different priority levels, that may > >increase to allow even more fine grained selection. Allocating those in > >advance seems a waste (and may be impossible), so we opt for allocating > >upon first use, and freeing after its requests are depleted. To avoid > >the possibility of an allocation failure causing us to lose a request, > >we preallocate the default priority (0) and bump any request to that > >priority if we fail to allocate it the appropriate plist. Having a > >request (that is ready to run, so not leading to corruption) execute > >out-of-order is better than leaking the request (and its dependency > >tree) entirely. > > > >There should be a benefit to reducing execlists_dequeue() to principally > >using a simple list (and reducing the frequency of both rbtree iteration > >and balancing on erase) but for typical workloads, request coalescing > >should be small enough that we don't notice any change. The main gain is > >from improving PI calls to schedule, and the explicit list within a > >level should make request unwinding simpler (we just need to insert at > >the head of the list rather than the tail and not have to make the > >rbtree search more complicated). > > Sounds attractive! What workloads show the benefit and how much? The default will show the best, since everything is priority 0 more or less and so we reduce the rbtree search to a single lookup and list_add. It's hard to measure the impact of the rbtree though. On the dequeue side, the mmio access dominates. On the schedule side, if we have lots of requests, the dfs dominates. I have an idea on how we might stress the rbtree in submit_request - but still it requires long queues untypical of most workloads. Still tbd. > >-static bool insert_request(struct i915_priotree *pt, struct rb_root *root) > >+static bool > >+insert_request(struct intel_engine_cs *engine, > >+ struct i915_priotree *pt, > >+ int prio) > > { > >+ struct execlist_priolist *plist; > > struct rb_node **p, *rb; > > bool first = true; > > > >+find_plist: > > /* most positive priority is scheduled first, equal priorities fifo */ > > rb = NULL; > >- p = &root->rb_node; > >+ p = &engine->execlist_queue.rb_node; > > while (*p) { > >- struct i915_priotree *pos; > >- > > rb = *p; > >- pos = rb_entry(rb, typeof(*pos), node); > >- if (pt->priority > pos->priority) { > >+ plist = rb_entry(rb, typeof(*plist), node); > >+ if (prio > plist->priority) { > > p = &rb->rb_left; > >- } else { > >+ } else if (prio < plist->priority) { > > p = &rb->rb_right; > > first = false; > >+ } else { > >+ list_add_tail(&pt->link, &plist->requests); > >+ return false; > > } > > } > >- rb_link_node(&pt->node, rb, p); > >- rb_insert_color(&pt->node, root); > >+ > >+ if (!prio) { > >+ plist = &engine->default_priolist; > > Should be "prio == I915_PRIO_DEFAULT" (give or take). > > But I am not completely happy with special casing the default > priority for two reasons. > > Firstly, userspace can opt to lower its priority and completely > defeat this path. > > Secondly, we already have flip priority which perhaps should have > it's own fast path / avoid allocation as well. > > Those two combined make me unsure whether the optimisation is worth > it. What would be the pros and cons of three steps: > > 1. No optimisation. > 2. prio == default optimisation like above. > 3. Better system with caching of frequently used levels. > > Last is definitely complicated, second is not, but is the second > much better than the first? It was not intended as an optimisation. It is for handling the ENOMEM here. We cannot abort the request at such a late stage, so we need somewhere to hold it. That dictated having a preallocted slot. I also didn't like having to preallocate all possible levels as that seems a waste, especially as I like to invent new levels and suspect that we may end up using a full u32 range. Using it for the default priority was then to take advantage of the preallocation. > Perhaps a simplification of 3) where we would defer the freeing of > unused priority levels until the busy to idle transition? That would > also drop the existence and need for special handling of > engine->default_prio. > > >+ } else { > >+ plist = kmalloc(sizeof(*plist), GFP_ATOMIC); > >+ /* Convert an allocation failure to a priority bump */ > > Where is the priority bump? It looks like it can be the opposite for > high prio requests below. Correct. Bump was the best verb I thought of. > I don't think it matters what happens with priorities hugely when > small allocations start to go bad but would like to understand the > comment. > > And perhaps this would be worthy of a dedicated slab cache? Even with a slab cache, we cannot prevent allocation failure. I don't think priority levels will be frequent enough to really justify one. Should be a good match for the common kmalloc-64 slab. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx