On Tue, Apr 26, 2016 at 10:52:41AM +0100, Dave Gordon wrote: > On 26/04/16 09:49, Dave Gordon wrote: > >On 25/04/16 11:39, Chris Wilson wrote: > >>On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote: > >>>On 22/04/16 19:45, Chris Wilson wrote: > > [snip] > > >>>>And what exactly is that atomic64_cmpxchg() serialising with? There are > >>>>no other CPUs contending with the write, and neither does the GuC > >>>>(and I > >>>>doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where > >>>>a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest > >>>>instruction and function in the kernel. > >>> > >>>The doorbell controller hardware, I should think. The BSpec > >>>describes using LOCK_CMPXCHG8B to update doorbells, so I think this > >>>code is just based on what it says there. If the CPU hardware > >>>doesn't implement it efficiently, surely the GPU h/w designers > >>>wouldn't have mandated it in this way? > >> > >>Wow, I'm surprised that they would put into the same domain. Still, > >>unless you are actually serialising with another writer, what is the > >>point of using lock cmpxchg? E.g. an xchg would be enough to enforce > >>ordering, and you should ask them again if this is not a little overkill > >>for one-way signaling. > >>-Chris > > As for performance, while LOCK_CMPXCHG8B might be an expensive > instruction, we're only executing ONE per request. I suspect that > the cumulative cost of all the extra memory accesses caused by extra > indirections and poor structure layout cost far more than any single > instruction ever can. > > Top things in this area might be: > > * all the macros taking "dev" instead of "dev_priv" > * pointer dances in general (a->b->c.d->e) where we could add a > shortcut pointer from a to c (or c.d), or from a or b to e. > * way too much repetition of a->b->c, a->b->d, a->b->e in the same > function, which the compiler *may* optimise, but probably won't if > there are any function calls around. Adding a local for a->b will > almost certainly help, or at least incur no penalty and be easier to > read. > * awkwardly sized or misaligned structure members, and bitfield > bools rather than 1-byte flags > > So let's nibble away at these before we worry about the cost of a > single x86 instruction! You can either assume that I applied the patches I sent to the ml for the above points, or just look at the delta between execlists and guc and worry about the regressions. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx