Re: [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible

Dave Gordon <david.s.gordon@xxxxxxxxx> · Tue, 26 Apr 2016 14:36:21 +0100

On 26/04/16 11:35, Chris Wilson wrote:
On Tue, Apr 26, 2016 at 10:52:41AM +0100, Dave Gordon wrote:
On 26/04/16 09:49, Dave Gordon wrote:
On 25/04/16 11:39, Chris Wilson wrote:
On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
On 22/04/16 19:45, Chris Wilson wrote:

[snip]

And what exactly is that atomic64_cmpxchg() serialising with? There are
no other CPUs contending with the write, and neither does the GuC
(and I
doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
instruction and function in the kernel.

The doorbell controller hardware, I should think. The BSpec
describes using LOCK_CMPXCHG8B to update doorbells, so I think this
code is just based on what it says there. If the CPU hardware
doesn't implement it efficiently, surely the GPU h/w designers
wouldn't have mandated it in this way?

Wow, I'm surprised that they would put into the same domain. Still,
unless you are actually serialising with another writer, what is the
point of using lock cmpxchg? E.g. an xchg would be enough to enforce
ordering, and you should ask them again if this is not a little overkill
for one-way signaling.
-Chris

As for performance, while LOCK_CMPXCHG8B might be an expensive
instruction, we're only executing ONE per request. I suspect that
the cumulative cost of all the extra memory accesses caused by extra
indirections and poor structure layout cost far more than any single
instruction ever can.

Top things in this area might be:

* all the macros taking "dev" instead of "dev_priv"
* pointer dances in general (a->b->c.d->e) where we could add a
shortcut pointer from a to c (or c.d), or from a or b to e.
* way too much repetition of a->b->c, a->b->d, a->b->e in the same
function, which the compiler *may* optimise, but probably won't if
there are any function calls around. Adding a local for a->b will
almost certainly help, or at least incur no penalty and be easier to
read.
* awkwardly sized or misaligned structure members, and bitfield
bools rather than 1-byte flags

So let's nibble away at these before we worry about the cost of a
single x86 instruction!

You can either assume that I applied the patches I sent to the ml for
the above points, or just look at the delta between execlists and guc
and worry about the regressions.
-Chris

With the latest version of this patchset (not yet posted), I see GuC 
submission just slightly *faster* than execlists on the render ring :)
However it's still slower on the others, where the minimum execlist time 
is less.

Just running noops, execlist submission on render shows a downward trend 
as more noop batches are submitted, flattening to a limit of just over 
3us/batch. GuC mode shows the same trend, and bottoms out at just UNDER 
3us/batch. The crossover seems to be 512-1024 batches/cycle, where the 
reduced interrupt overhead outweighs the added latency.

It probably helps that we're no longer mapping & unmapping the doorbell 
page on each request!

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx