Re: [PATCH v2] drm/i915: Execlist irq handler micro optimisations

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Fri, 12 Feb 2016 16:22:21 +0000

On Fri, Feb 12, 2016 at 03:54:27PM +0000, Tvrtko Ursulin wrote:
> 
> On 12/02/16 14:42, Chris Wilson wrote:
> >On Fri, Feb 12, 2016 at 12:00:40PM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> >>
> >>Assorted changes most likely without any practical effect
> >>apart from a tiny reduction in generated code for the interrupt
> >>handler and request submission.
> >>
> >>  * Remove needless initialization.
> >>  * Improve cache locality by reorganizing code and/or using
> >>    branch hints to keep unexpected or error conditions out
> >>    of line.
> >>  * Favor busy submit path vs. empty queue.
> >>  * Less branching in hot-paths.
> >>
> >>v2:
> >>
> >>  * Avoid mmio reads when possible. (Chris Wilson)
> >>  * Use natural integer size for csb indices.
> >>  * Remove useless return value from execlists_update_context.
> >>  * Extract 32-bit ppgtt PDPs update so it is out of line and
> >>    shared with two callers.
> >>  * Grab forcewake across all mmio operations to ease the
> >>    load on uncore lock and use chepear mmio ops.
> >>
> >>Version 2 now makes the irq handling code path ~20% smaller on
> >>48-bit PPGTT hardware, and a little bit less elsewhere. Hot
> >>paths are mostly in-line now and hammering on the uncore
> >>spinlock is greatly reduced together with mmio traffic to an
> >>extent.
> >
> >Did you notice that ring->next_context_status_buffer is redundant as we
> >also have that information to hand in status_pointer?
> 
> I didn't and don't know that part that well. There might be some
> future proofing issues around it as well.

Unlikely :-p

> >What's your thinking for
> >
> >	if (req->elsp_submitted & ring->gen8_9)
> >
> >vs a plain
> >
> >	if (req->elsp_submitted)
> >?
> 
> Another don't know this part that well. Is it not useful to not
> submit two noops if they are not needed? Do they still end up
> submitted to the GPU somehow?

The command streamer always has to execute them since they lie between
the last dispatch TAIL and the next TAIL (in the lrc). All we do here is
to tweak the request->tail value, that may or may not be used the next
time we write the ELSP (depending on whether we are submitting the same
live request again). (The next request's tail will include the noops
before it's dispatch.)

> >The tidies look good. Be useful to double check whether gem_latency is
> >behaving as a canary, it's a bit of a puzzle why that first dispatch
> >latency would grow.
> 
> Yes a puzzle, no idea how and why. But "gem_latency -n 100" does not
> show this regression. I've done a hundred runs and these are the
> results:
> 
>  * Throughput up 4.04%
>  * Dispatch latency down 0.37%
>  * Consumer and producer latencies down 22.53%
>  * CPU time down 2.25%
> 
> So it all looks good.

Yup.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx