Re: [PATCH v2] drm/i915: Execlist irq handler micro optimisations

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Fri, 12 Feb 2016 15:54:27 +0000

On 12/02/16 14:42, Chris Wilson wrote:
On Fri, Feb 12, 2016 at 12:00:40PM +0000, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

Assorted changes most likely without any practical effect
apart from a tiny reduction in generated code for the interrupt
handler and request submission.

  * Remove needless initialization.
  * Improve cache locality by reorganizing code and/or using
    branch hints to keep unexpected or error conditions out
    of line.
  * Favor busy submit path vs. empty queue.
  * Less branching in hot-paths.

v2:

  * Avoid mmio reads when possible. (Chris Wilson)
  * Use natural integer size for csb indices.
  * Remove useless return value from execlists_update_context.
  * Extract 32-bit ppgtt PDPs update so it is out of line and
    shared with two callers.
  * Grab forcewake across all mmio operations to ease the
    load on uncore lock and use chepear mmio ops.

Version 2 now makes the irq handling code path ~20% smaller on
48-bit PPGTT hardware, and a little bit less elsewhere. Hot
paths are mostly in-line now and hammering on the uncore
spinlock is greatly reduced together with mmio traffic to an
extent.

Did you notice that ring->next_context_status_buffer is redundant as we
also have that information to hand in status_pointer?

I didn't and don't know that part that well. There might be some future 
proofing issues around it as well.

What's your thinking for

	if (req->elsp_submitted & ring->gen8_9)

vs a plain

	if (req->elsp_submitted)
?

Another don't know this part that well. Is it not useful to not submit 
two noops if they are not needed? Do they still end up submitted to the 
GPU somehow?

The tidies look good. Be useful to double check whether gem_latency is
behaving as a canary, it's a bit of a puzzle why that first dispatch
latency would grow.

Yes a puzzle, no idea how and why. But "gem_latency -n 100" does not 
show this regression. I've done a hundred runs and these are the results:

 * Throughput up 4.04%
 * Dispatch latency down 0.37%
 * Consumer and producer latencies down 22.53%
 * CPU time down 2.25%

So it all looks good.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx