On 18/08/16 16:27, Dave Gordon wrote:
[snip]
Note that SKL GuC firmware 6.1 didn't support dual submission or lite
restore, whereas the next version (8.11) does. Therefore, with that
firmware we don't see the same slowdown when going to 1-at-a-time
round-robin. I have a different (new) test that shows this more clearly.
This is with GuC version 6.1:
skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS
Time to exec 8-byte batch: 3.428µs (ring=render)
Time to exec 8-byte batch: 2.444µs (ring=bsd)
Time to exec 8-byte batch: 2.394µs (ring=blt)
Time to exec 8-byte batch: 2.615µs (ring=vebox)
Time to exec 8-byte batch: 2.625µs (ring=all, sequential)
Time to exec 8-byte batch: 12.701µs (ring=all, parallel/1) ***
Time to exec 8-byte batch: 7.259µs (ring=all, parallel/2)
Time to exec 8-byte batch: 4.336µs (ring=all, parallel/4)
Time to exec 8-byte batch: 2.937µs (ring=all, parallel/8)
Time to exec 8-byte batch: 2.661µs (ring=all, parallel/16)
Time to exec 8-byte batch: 2.245µs (ring=all, parallel/32)
Time to exec 8-byte batch: 1.626µs (ring=all, parallel/64)
Time to exec 8-byte batch: 2.170µs (ring=all, parallel/128)
Time to exec 8-byte batch: 1.804µs (ring=all, parallel/256)
Time to exec 8-byte batch: 2.602µs (ring=all, parallel/512)
Time to exec 8-byte batch: 2.602µs (ring=all, parallel/1024)
Time to exec 8-byte batch: 2.607µs (ring=all, parallel/2048)
Time to exec 4Kbyte batch: 14.835µs (ring=render)
Time to exec 4Kbyte batch: 11.787µs (ring=bsd)
Time to exec 4Kbyte batch: 11.533µs (ring=blt)
Time to exec 4Kbyte batch: 11.991µs (ring=vebox)
Time to exec 4Kbyte batch: 12.444µs (ring=all, sequential)
Time to exec 4Kbyte batch: 16.211µs (ring=all, parallel/1)
Time to exec 4Kbyte batch: 13.943µs (ring=all, parallel/2)
Time to exec 4Kbyte batch: 13.878µs (ring=all, parallel/4)
Time to exec 4Kbyte batch: 13.841µs (ring=all, parallel/8)
Time to exec 4Kbyte batch: 14.188µs (ring=all, parallel/16)
Time to exec 4Kbyte batch: 13.747µs (ring=all, parallel/32)
Time to exec 4Kbyte batch: 13.734µs (ring=all, parallel/64)
Time to exec 4Kbyte batch: 13.727µs (ring=all, parallel/128)
Time to exec 4Kbyte batch: 13.947µs (ring=all, parallel/256)
Time to exec 4Kbyte batch: 12.230µs (ring=all, parallel/512)
Time to exec 4Kbyte batch: 12.147µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch: 12.617µs (ring=all, parallel/2048)
What this shows is that the submission overhead is ~3us which is
comparable with the execution time of a trivial (8-byte) batch, but
insignificant compared with the time to execute the 4Kbyte batch. The
burst size therefore makes very little difference to the larger batches.
.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx