On Thu, Sep 01, 2016 at 05:51:09PM +0100, Dave Gordon wrote: > The gem_exec_nop test generally works by submitting batches to an > engine as fast as possible for a fixed time, then finally calling > gem_sync() to wait for the last submitted batch to complete. The > time-per-batch is then calculated as the total elapsed time, divided > by the total number of batches submitted. > > The problem with this approach as a measurement of driver overhead, > or latency (or anything else) is that the amount of work involved in > submitting a batch is not a simple constant; in particular, it > depends on the state of the various queues in the execution path. > And it has the rather strange characteristic that if the GPU runs > slightly faster, the driver may go much slower! > > The main reason here is the lite-restore mechanism, although it > interacts with dual-submission and the details of handling the > completion interrupt. In particular, lite-restore means that it can > be much cheaper to add a request to an engine that's already (or > still) busy with a previous request than to send a new request to an > idle engine. > > For example, imagine that it takes the (test/CPU/driver) 2us to > prepare a request up to the point of submission, but another 4us to > push it into the submission port. Also assume that once started, > this batch takes 3us to execute on the GPU, and handling the > completion takes the driver another 2us of CPU time. Then the stream > of requests will produce a pattern like this: > > t0: batch 1: 6us from user to h/w (idle->busy) > t0+6us: GPU now running batch 1 > t0+8us: batch 2: 2us from user to queue (not submitted) > t0+9us: GPU finished; IRQ handler samples queue (batch 2) > t0+10us: batch 3: 2us from user to queue (not submitted) > t0+11us: IRQ handler submits tail of batch 2 > t0+12us: batch 4: 2us from user to queue (not submitted) > t0+14us: batch 5: 2us from user to queue (not submitted) > t0+15us: GPU now running batch 2 > t0+16us: batch 6: 2us from user to queue (not submitted) > t0+18us: GPU finished; IRQ handler samples queue (batch 6) > t0+18us: batch 7: 2us from user to queue (not submitted) > t0+20us: batch 8: 2us from user to queue (not submitted) > t0+20us: IRQ handler coalesces requests, submits tail of batch 6 > t0+20us: batch 9: 2us from user to queue (not submitted) > t0+22us: batch 10: 2us from user to queue (not submitted) > t0+24us: GPU now running batches 3-6 > t0+24us: batch 11: 2us from user to queue (not submitted) > t0+26us: batch 12: 2us from user to queue (not submitted) > t0+28us: batch 13: 2us from user to queue (not submitted) > t0+30us: batch 14: 2us from user to queue (not submitted) > t0+32us: batch 15: 2us from user to queue (not submitted) > t0+34us: batch 16: 2us from user to queue (not submitted) > t0+36us: GPU finished; IRQ handler samples queue (batch 16) > t0+36us: batch 17: 2us from user to queue (not submitted) > t0+38us: batch 18: 2us from user to queue (not submitted) > t0+38us: IRQ handler coalesces requests, submits tail of batch 16 > t0+40us: batch 19: 2us from user to queue (not submitted) > t0+42us: batch 20: 2us from user to queue (not submitted) > t0+42us: GPU now running batches 7-16 > > Thus, after the first few, *all* requests will be coalesced, and > only a few of them will incur the overhead of writing to the ELSP or > handling a context-complete interrupt. With the CPU generating a new > batch every 2us and the GPU taking 3us/batch to execute them, the > queue of outstanding requests will get longer and longer until the > ringbuffer is nearly full, but the write to the ELSP will happen > ever more rarely. > > When we measure the overall time for the process, we will find the > result is 3us/batch, i.e. the GPU batch execution time. The > coalescing means that all the driver *and hardware* overheads are > *completely* hidden. > > Now consider what happens if the batches are generated and submitted > slightly slower, only one every 4us: > > t1: batch 1: 6us from user to h/w (idle->busy) > t1+6us: GPU now running batch 1 > t1+9us: GPU finished; IRQ handler samples queue (empty) > t1+10us: batch 2: 6us from user to h/w (idle->busy) > t1+16us: GPU now running batch 2 > t1+19us: GPU finished; IRQ handler samples queue (empty) > t1+20us: batch 3: 6us from user to h/w (idle->busy) > etc > > This hits the worst case, where *every* batch submission needs to go > through the most expensive path (and in doing so, delays the > creation of the next workload, so we will never get out of this > pattern). Our measurement will therefore show 10us/batch. > > *IF* we didn't have a BKL, it would be reasonable to expect that a > suitable multi-threaded program on a CPU with more h/w threads than > GPU engines could submit batches on any set of engines in parallel, > and for each thread and engine, the execution time would be > essentially independent of which engines were running concurrently. > > Unfortunately, though, that lock-free scenario is not what we have > today. The BKL means that only one thread can submit at a time (and > in any case, the test program isn't multi-threaded). Therefore, if > the test can generate and submit batches at a rate of one every 2us > (as in the first "GOOD" scenario above), but those batches are being > split across two different engines, it results in an effective > submission rate of one per 4us, and flips into the second "BAD" > scenario as a result. > > The conclusion, then, is that the parallel execution part of this > test as written today isn't really measuring a meaningful quantity, > and the pass-fail criterion in particular isn't telling us anything > useful about the overhead (or latency) of various parts of the > submission path. > > I've written another test variant, which explores the NO-OP > execution time as a function of both batch buffer size and the > number of consecutive submissions to the same engine before > switching to the next (burst size). Typical results look something > like this: They already exist as well. Do please look again at what test you are complaining about. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx