On 01/09/16 21:00, Chris Wilson wrote:
On Thu, Sep 01, 2016 at 05:51:09PM +0100, Dave Gordon wrote:
The gem_exec_nop test generally works by submitting batches to an
engine as fast as possible for a fixed time, then finally calling
gem_sync() to wait for the last submitted batch to complete. The
time-per-batch is then calculated as the total elapsed time, divided
by the total number of batches submitted.
The problem with this approach as a measurement of driver overhead,
or latency (or anything else) is that the amount of work involved in
submitting a batch is not a simple constant; in particular, it
depends on the state of the various queues in the execution path.
And it has the rather strange characteristic that if the GPU runs
slightly faster, the driver may go much slower!
The main reason here is the lite-restore mechanism, although it
interacts with dual-submission and the details of handling the
completion interrupt. In particular, lite-restore means that it can
be much cheaper to add a request to an engine that's already (or
still) busy with a previous request than to send a new request to an
idle engine.
For example, imagine that it takes the (test/CPU/driver) 2us to
prepare a request up to the point of submission, but another 4us to
push it into the submission port. Also assume that once started,
this batch takes 3us to execute on the GPU, and handling the
completion takes the driver another 2us of CPU time. Then the stream
of requests will produce a pattern like this:
t0: batch 1: 6us from user to h/w (idle->busy)
t0+6us: GPU now running batch 1
t0+8us: batch 2: 2us from user to queue (not submitted)
t0+9us: GPU finished; IRQ handler samples queue (batch 2)
t0+10us: batch 3: 2us from user to queue (not submitted)
t0+11us: IRQ handler submits tail of batch 2
t0+12us: batch 4: 2us from user to queue (not submitted)
t0+14us: batch 5: 2us from user to queue (not submitted)
t0+15us: GPU now running batch 2
t0+16us: batch 6: 2us from user to queue (not submitted)
t0+18us: GPU finished; IRQ handler samples queue (batch 6)
t0+18us: batch 7: 2us from user to queue (not submitted)
t0+20us: batch 8: 2us from user to queue (not submitted)
t0+20us: IRQ handler coalesces requests, submits tail of batch 6
t0+20us: batch 9: 2us from user to queue (not submitted)
t0+22us: batch 10: 2us from user to queue (not submitted)
t0+24us: GPU now running batches 3-6
t0+24us: batch 11: 2us from user to queue (not submitted)
t0+26us: batch 12: 2us from user to queue (not submitted)
t0+28us: batch 13: 2us from user to queue (not submitted)
t0+30us: batch 14: 2us from user to queue (not submitted)
t0+32us: batch 15: 2us from user to queue (not submitted)
t0+34us: batch 16: 2us from user to queue (not submitted)
t0+36us: GPU finished; IRQ handler samples queue (batch 16)
t0+36us: batch 17: 2us from user to queue (not submitted)
t0+38us: batch 18: 2us from user to queue (not submitted)
t0+38us: IRQ handler coalesces requests, submits tail of batch 16
t0+40us: batch 19: 2us from user to queue (not submitted)
t0+42us: batch 20: 2us from user to queue (not submitted)
t0+42us: GPU now running batches 7-16
Thus, after the first few, *all* requests will be coalesced, and
only a few of them will incur the overhead of writing to the ELSP or
handling a context-complete interrupt. With the CPU generating a new
batch every 2us and the GPU taking 3us/batch to execute them, the
queue of outstanding requests will get longer and longer until the
ringbuffer is nearly full, but the write to the ELSP will happen
ever more rarely.
When we measure the overall time for the process, we will find the
result is 3us/batch, i.e. the GPU batch execution time. The
coalescing means that all the driver *and hardware* overheads are
*completely* hidden.
Now consider what happens if the batches are generated and submitted
slightly slower, only one every 4us:
t1: batch 1: 6us from user to h/w (idle->busy)
t1+6us: GPU now running batch 1
t1+9us: GPU finished; IRQ handler samples queue (empty)
t1+10us: batch 2: 6us from user to h/w (idle->busy)
t1+16us: GPU now running batch 2
t1+19us: GPU finished; IRQ handler samples queue (empty)
t1+20us: batch 3: 6us from user to h/w (idle->busy)
etc
This hits the worst case, where *every* batch submission needs to go
through the most expensive path (and in doing so, delays the
creation of the next workload, so we will never get out of this
pattern). Our measurement will therefore show 10us/batch.
*IF* we didn't have a BKL, it would be reasonable to expect that a
suitable multi-threaded program on a CPU with more h/w threads than
GPU engines could submit batches on any set of engines in parallel,
and for each thread and engine, the execution time would be
essentially independent of which engines were running concurrently.
Unfortunately, though, that lock-free scenario is not what we have
today. The BKL means that only one thread can submit at a time (and
in any case, the test program isn't multi-threaded). Therefore, if
the test can generate and submit batches at a rate of one every 2us
(as in the first "GOOD" scenario above), but those batches are being
split across two different engines, it results in an effective
submission rate of one per 4us, and flips into the second "BAD"
scenario as a result.
The conclusion, then, is that the parallel execution part of this
test as written today isn't really measuring a meaningful quantity,
and the pass-fail criterion in particular isn't telling us anything
useful about the overhead (or latency) of various parts of the
submission path.
I've written another test variant, which explores the NO-OP
execution time as a function of both batch buffer size and the
number of consecutive submissions to the same engine before
switching to the next (burst size). Typical results look something
like this:
They already exist as well.
I expect so, but they are not being used to gate upstreaming of patches
to the submission paths. I wanted a test that would show how the
positive feedback loop in submission timing causes the driver to
abruptly flip between a best-case pattern (when workloads are generated
faster than they are completed) and a worst-case pattern (when it takes
longer to submit one batch to *each* engine sequentially than it takes
*one* engine to complete one batch).
Do please look again at what test you are complaining about.
-Chris
The one that contains this unjustified assertion:
/* The rate limiting step is how fast the slowest engine can
* its queue of requests, if we wait upon a full ring all dispatch
* is frozen. So in general we cannot go faster than the slowest
* engine, but we should equally not go any slower.
*/
igt_assert_f(time < max + 10*min/9, /* ensure parallel execution */
"Average time (%.3fus) exceeds expecation for parallel execution
(min %.3fus, max %.3fus; limit set at %.3fus)\n",
1e6*time, 1e6*min, 1e6*max, 1e6*(max + 10*min/9));
because as explained above, there is no reasonable expectation that
dispatching batches to multiple engines in parallel will result in more
batches being executed in the same time, and with a purely serial test
process, every expectation that the average time per batch will increase.
The rate-limiting step *would* be how fast the slowest engine could
process if its queue of requests *if* the batches took long enough that
the CPU could always queue more work for every engine before the
previous workload completed; but with tiny workloads the CPU does not
keep up and submission overhead increases because the driver must then
do *more* work to *restart* the engine if it has become idle.
(And not even mentioning how the engine may have decided, after a
certain period of idleness, to initiate a context save which must be
completed before a new command can be accepted, even if the new command
uses the same context. At least it doesn't then reload the same context,
AFAICT).
.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx