Quoting Tvrtko Ursulin (2019-04-30 09:55:59) > > On 29/04/2019 19:00, Chris Wilson wrote: > > Asking the GPU to busywait on a memory address, perhaps not unexpectedly > > in hindsight for a shared system, leads to bus contention that affects > > CPU programs trying to concurrently access memory. This can manifest as > > a drop in transcode throughput on highly over-saturated workloads. > > > > The only clue offered by perf, is that the bus-cycles (perf stat -e > > bus-cycles) jumped by 50% when enabling semaphores. This corresponds > > with extra CPU active cycles being attributed to intel_idle's mwait. > > > > This patch introduces a heuristic to try and detect when more than one > > client is submitting to the GPU pushing it into an oversaturated state. > > As we already keep track of when the semaphores are signaled, we can > > inspect their state on submitting the busywait batch and if we planned > > to use a semaphore but were too late, conclude that the GPU is > > overloaded and not try to use semaphores in future requests. In > > practice, this means we optimistically try to use semaphores for the > > first frame of a transcode job split over multiple engines, and fail is > > there are multiple clients active and continue not to use semaphores for > > the subsequent frames in the sequence. Periodically, trying to > > optimistically switch semaphores back on whenever the client waits to > > catch up with the transcode results. > > > > [snipped long benchmark results] > > > Indicating that we've recovered the regression from enabling semaphores > > on this saturated setup, with a hint towards an overall improvement. > > > > Very similar, but of smaller magnitude, results are observed on both > > Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of > > bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big > > core, in this particular test. > > > > One observation to make here is that for a greedy client trying to > > maximise its own throughput, using semaphores is the right choice. It is > > only the holistic system-wide view that semaphores of one client > > impacts another and reduces the overall throughput where we would choose > > to disable semaphores. > > Since we acknowledge problem is the shared nature of the iGPU, my > concern is that we still cannot account for both partners here when > deciding to omit semaphore emission. In other words we trade bus > throughput for submission latency. > > Assuming a light GPU task (in the sense of not oversubscribing, but with > ping-pong inter-engine dependencies), simultaneous to a heavier CPU > task, our latency improvement still imposes a performance penalty on the > latter. Maybe, maybe not. I think you have to be in the position where there is no GPU latency to be gained for the increased bus traffic to lose. > For instance a consumer level single stream transcoding session with CPU > heavy part of the pipeline, or a CPU intensive game. > > (Ideally we would need a bus saturation signal to feed into our logic, > not just engine saturation. Which I don't think is possible.) > > So I am still leaning towards being cautious and just abandoning > semaphores for now. Being greedy, the single consumer case is compelling. The same benchmarks see 5-10% throughput improvement for the single client (depending on machine). -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx