On 03/08/16 16:45, Chris Wilson wrote:
On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
The parallel execution test in gem_exec_nop chooses a pessimal
distribution of work to multiple engines; specifically, it
round-robins one batch to each engine in turn. As the workloads
are trivial (NOPs), this results in each engine becoming idle
between batches. Hence parallel submission is seen to take LONGER
than the same number of batches executed sequentially.
If on the other hand we send enough work to each engine to keep
it busy until the next time we add to its queue, (i.e. round-robin
some larger number of batches to each engine in turn) then we can
get true parallel execution and should find that it is FASTER than
sequential execuion.
By experiment, burst sizes of between 8 and 256 are sufficient to
keep multiple engines loaded, with the optimum (for this trivial
workload) being around 64. This is expected to be lower (possibly
as low as one) for more realistic (heavier) workloads.
Quite funny. The driver submission overhead of A...A vs ABAB... engines
is nearly identical, at least as far as the analysis presented here.
-Chris
Correct; but because the workloads are so trivial, if we hand out jobs
one at a time to each engine, the first will have finished the one batch
it's been given before we get round to giving at a second one (even in
execlist mode). If there are N engines, submitting a single batch takes
S seconds, and the workload takes W seconds to execute, then if W < N*S
the engine will be idle between batches. For example, if N is 4, W is
2us, and S is 1us, then the engine will be idle some 50% of the time.
This wouldn't be an issue for more realistic workloads, where W >> S.
It only looks problematic because of the trivial nature of the work.
.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx