On Mon, Mar 23, 2015 at 09:31:38AM +0100, Daniel Vetter wrote: > On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote: > > On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote: > > > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1 > > > Window, but rendering internally at 1080p] clients we can run in > > > parallel whilst hitting 60fps. And then whether allowing multiple > > > spinners helps or hinders. > > > > I was thinking of a nice easy test that could demonstrate any advantage > > for spinning over waiting, and realised we already had such an igt. The > > trick is that it has to generate sufficient GPU load to actually require > > a wait, but not too high a GPU load such that we can see the impact from > > slow completion. > > > > I present igt/gem_exec_blt (modified to repeat the measurement and do an > > average over several runs): > > > > Time to blt 16384 bytes x 1: 21.000µs -> 5.800µs > > Time to blt 16384 bytes x 2: 11.500µs -> 4.500µs > > Time to blt 16384 bytes x 4: 6.750µs -> 3.750µs > > Time to blt 16384 bytes x 8: 4.950µs -> 3.375µs > > Time to blt 16384 bytes x 16: 3.825µs -> 3.175µs > > Time to blt 16384 bytes x 32: 3.356µs -> 3.000µs > > Time to blt 16384 bytes x 64: 3.259µs -> 2.909µs > > Time to blt 16384 bytes x 128: 3.083µs -> 3.095µs > > Time to blt 16384 bytes x 256: 3.104µs -> 2.979µs > > Time to blt 16384 bytes x 512: 3.080µs -> 3.089µs > > Time to blt 16384 bytes x 1024: 3.077µs -> 3.040µs > > Time to blt 16384 bytes x 2048: 3.127µs -> 3.304µs > > Time to blt 16384 bytes x 4096: 3.279µs -> 3.265µs > > We probably need to revisit this when the scheduler lands - that one will > want to keep a short queue and generally will block for some request to > complete. Speaking of which, execlists! You may have noticed that I surreptitiously choose hsw to avoid the execlists overhead... I was messing around over the weekend looking at the submission overhead on bdw-u: -nightly +spin +hax execlists=0 x1: 23.600µs 18.400µs 15.200µs 6.800µs x2: 19.700µs 16.500µs 15.900µs 5.000µs x4: 15.600µs 12.250µs 12.500µs 4.450µs x8: 13.575µs 11.000µs 11.650µs 4.050µs x16: 10.812µs 9.738µs 9.875µs 3.900µs x32: 9.281µs 8.613µs 9.406µs 3.750µs x64: 8.088µs 7.988µs 8.806µs 3.703µs x128: 7.683µs 7.838µs 8.617µs 3.647µs x256: 9.481µs 7.301µs 8.091µs 3.409µs x512: 5.579µs 5.992µs 6.177µs 3.561µs x1024: 10.093µs 3.963µs 4.187µs 3.531µs x2048: 11.497µs 3.794µs 3.873µs 3.477µs x4096: 8.926µs 5.269µs 3.813µs 3.461µs The hax are to remove the extra atomic ops and spinlocks imposed by execlists. Steady state seems to be roughly on a par with the difference appearing to be interrupt latency + extra register writes. What's interesting is the latency caused by the ELSP submission mechanism to an idle GPU, a hard floor for us. It may even be worth papering over it by starting execlists from a tasklet. I do feel this sort of information is missing from the execlists merge... -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/intel-gfx