On Mon, 13 May 2019 13:48:08 +0100 Steven Price <steven.price@xxxxxxx> wrote: > On 12/05/2019 14:38, Boris Brezillon wrote: > > On Sat, 11 May 2019 15:32:20 -0700 > > Alyssa Rosenzweig <alyssa@xxxxxxxxxxxxx> wrote: > > > >> Hi all, > >> > >> As Steven Price explained, the "GPU top" kbase approach is often more > >> useful and accurate than per-draw timing. > >> > >> For a 3D game inside a GPU-accelerated desktop, the games' counters > >> *should* include desktop overhead. This external overhead does affect > >> the game's performance, especially if the contexts are competing for > >> resources like memory bandwidth. An isolated sample is easy to achieve > >> running only the app of interest; in ideal conditions (zero-copy > >> fullscreen), desktop interference is negligible. > >> > >> For driver developers, the system-wide measurements are preferable, > >> painting a complete system picture and avoiding disruptions. There is no > >> risk of confusion, as the driver developers understand how the counters > >> are exposed. Further, benchmarks rendering direct to a GBM surface are > >> available (glmark2-es2-drm), eliminating interference even with poor > >> desktop performance. > >> > >> For app developers, the confusion of multi-context interference is > >> unfortunate. Nevertheless, if enabling counters were to slow down an > >> app, the confusion could be worse. Consider second-order changes in the > >> app's performance characteristics due to slowdown: if techniques like > >> dynamic resolution scaling are employed, the counters' results can be > >> invalid. Likewise, even if the lower-performance counters are > >> appropriate for purely graphical workloads, complex apps with variable > >> CPU overhead (e.g. from an FPS-dependent physics engine) can further > >> confound counters. Low-overhead system-wide measurements mitigate these > >> concerns. > > > > I'd just like to point out that dumping counters the way > > mali_kbase/gator does likely has an impact on perfs (at least on some > > GPUs) because of the clean+invalidate-cache that happens before (or > > after, I don't remember) each dump. IIUC and this cache is actually > > The clean is required after the dump to ensure that the CPU can see the > dumped counters (otherwise they could be stuck in the GPU's cache). Just to make it clear, I was not questioning the need for a cache clean here, I was simply pointing out that dumping counters is not free even when we don't add those serialization points. Note that dumping counters also has an impact on the cache-miss counter of future dumps, but there's nothing we can do about that no matter the solution we go for. > Note > that if you don't immediately need to observe the values the clean can > be deferred. E.g. if each dump you target a different memory region then > you could perform a flush only after several dumps. Correct. Having a pool of dump buffers might be an option if we need to minimize the impact of dumps at some point. > > > global and not a per address space thing (which would be possible if the > > cache lines contain a tag attaching them to a specific address space), > > that means all jobs running when the clean+invalidate happens will have > > extra cache misses after each dump. Of course, that's not as invasive as > > the full serialization that happens with my solution, but it's not free > > either. > > Cache cleans (at the L2 cache level) are global. There are L1 caches > (and TLBs) but they are not relevant to counter dumping. Okay. > > In practice cache cleans are not that expensive most of the time. As you > note - mali_kbase doesn't bother to avoid them when doing counter capture. > > It should also be unnecessary to "invalidate" - a clean should be > sufficient. The invalidate is only required if e.g. MMU page tables have > been modified as well. I already dropped the invalidate in my patch as we only have a global address space right now and the buffer used to store perfcnt values is always the same. We'll have to rework that when support for per-context address space is added. > That would reduce the affect on currently running > jobs. But I notice that mali_kbase doesn't appear to use the "clean > only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being > broken though. 'clean-only' seems to work fine on T860 at least. > > >> > >> As Rob Clark suggested, system-wide counters could be exposed via a > >> semi-standardized interface, perhaps within debugfs/sysfs. The interface > >> could not be completely standard, as the list of counters exposed varies > >> substantially by vendor and model. Nevertheless, the mechanics of > >> discovering, enabling, reading, and disabling counters can be > >> standardized, as can a small set of universally meaningful counters like > >> total GPU utilization. This would permit a vendor-independent GPU top > >> app as suggested, as is I believe currently possible with > >> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali) > >> > >> It looks like this discussion is dormant. Could we try to get this > >> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable > >> to diagnose without access to the counters, so I'm eager for a mainline > >> solution to be implemented. > > > > I spent a bit of time thinking about it and looking at different > > solutions. > > > > debugfs/sysfs might not be the best solution, especially if we think > > about the multi-user case (several instances of GPU perfmon tool > > running in parallel), if we want it to work properly we need a way to > > instantiate several perf monitors and let the driver add values to all > > active perfmons everytime a dump happens (no matter who triggered the > > dump). That's exactly what mali_kbase/gator does BTW. That's achievable > > through debugs if we consider exposing a knob to instantiate such > > perfmon instances, but that also means risking perfmon leaks if the > > user does not take care of killing the perfmon it created when it's done > > with it (or when it crashes). It might also prove hard to expose that to > > non-root users in a secure way. > > Note that mali_kbase only has to maintain multiple sets of values > because it provides backwards compatibility to an old version of the > driver. You could maintain one set of counter values and simply ensure > they are big enough not to wrap (or handle wrapping, but that is ugly > and unlikely to happen). Very early versions of kbase simply exposed the > hardware functionality (dump counters and reset) and therefore faced the > issue that there could only ever be one user of counters at a time. This > was "fixed" by emulating the existence of multiple interfaces (vinstr - > "virtual" instrumentation). If I was redesigning it from scratch then > I'd definitely remove the "reset" part of the interface and leave it for > the clients to latch the first value (of each counter) and subtract that > from subsequent counter reads. Noted. > > Sadly the hardware doesn't help here and some (most?) GPU revisions will > saturate the counters rather than wrapping. Hence the need to frequently > poll and reset the counters, even if you only want the 'start'/'end' values. Not sure wrapping would be better, as you can't tell how many times you've reached the max val. BTW, counters are reset every time we do a dump, right? If that's the case, there's no risk to have end_val < start_val and the only risk is to get inaccurate values when counter_val > U32_MAX, but reaching U32_MAX already means that this event occurred a lot during the monitoring period. > > Having only one set of counters might simplify the potential for "leaks" > - although you still want to ensure that if nobody cares you stop > collecting counters... Handling that with a debugfs-based iface implies implementing some heuristic like 'counters haven't been dumped for a long time, let's stop monitoring them', and I'm not a big fan of such things. This being said, I think I'll go for a simple debugfs-based iface to unblock Alyssa. debugfs is not part of the stable-ABI and I guess we can agree on explicitly marking it as "unstable" so that when we settle on a generic interface to expose such counters we can get rid of the old one. _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel