On Sat, 11 May 2019 15:32:20 -0700 Alyssa Rosenzweig <alyssa@xxxxxxxxxxxxx> wrote: > Hi all, > > As Steven Price explained, the "GPU top" kbase approach is often more > useful and accurate than per-draw timing. > > For a 3D game inside a GPU-accelerated desktop, the games' counters > *should* include desktop overhead. This external overhead does affect > the game's performance, especially if the contexts are competing for > resources like memory bandwidth. An isolated sample is easy to achieve > running only the app of interest; in ideal conditions (zero-copy > fullscreen), desktop interference is negligible. > > For driver developers, the system-wide measurements are preferable, > painting a complete system picture and avoiding disruptions. There is no > risk of confusion, as the driver developers understand how the counters > are exposed. Further, benchmarks rendering direct to a GBM surface are > available (glmark2-es2-drm), eliminating interference even with poor > desktop performance. > > For app developers, the confusion of multi-context interference is > unfortunate. Nevertheless, if enabling counters were to slow down an > app, the confusion could be worse. Consider second-order changes in the > app's performance characteristics due to slowdown: if techniques like > dynamic resolution scaling are employed, the counters' results can be > invalid. Likewise, even if the lower-performance counters are > appropriate for purely graphical workloads, complex apps with variable > CPU overhead (e.g. from an FPS-dependent physics engine) can further > confound counters. Low-overhead system-wide measurements mitigate these > concerns. I'd just like to point out that dumping counters the way mali_kbase/gator does likely has an impact on perfs (at least on some GPUs) because of the clean+invalidate-cache that happens before (or after, I don't remember) each dump. IIUC and this cache is actually global and not a per address space thing (which would be possible if the cache lines contain a tag attaching them to a specific address space), that means all jobs running when the clean+invalidate happens will have extra cache misses after each dump. Of course, that's not as invasive as the full serialization that happens with my solution, but it's not free either. > > As Rob Clark suggested, system-wide counters could be exposed via a > semi-standardized interface, perhaps within debugfs/sysfs. The interface > could not be completely standard, as the list of counters exposed varies > substantially by vendor and model. Nevertheless, the mechanics of > discovering, enabling, reading, and disabling counters can be > standardized, as can a small set of universally meaningful counters like > total GPU utilization. This would permit a vendor-independent GPU top > app as suggested, as is I believe currently possible with > vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali) > > It looks like this discussion is dormant. Could we try to get this > sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable > to diagnose without access to the counters, so I'm eager for a mainline > solution to be implemented. I spent a bit of time thinking about it and looking at different solutions. debugfs/sysfs might not be the best solution, especially if we think about the multi-user case (several instances of GPU perfmon tool running in parallel), if we want it to work properly we need a way to instantiate several perf monitors and let the driver add values to all active perfmons everytime a dump happens (no matter who triggered the dump). That's exactly what mali_kbase/gator does BTW. That's achievable through debugs if we consider exposing a knob to instantiate such perfmon instances, but that also means risking perfmon leaks if the user does not take care of killing the perfmon it created when it's done with it (or when it crashes). It might also prove hard to expose that to non-root users in a secure way. I also had a quick look at the perf_event interface to see if we could extend it to support monitoring GPU events. I might be wrong as I didn't spend much time investigating how it works, but it seems that perf counters are saved/dumped/restored at each thread context switch, which is not what we want here (might add extra perfcnt dump points thus impacting GPU perfs more than we expect). So maybe the best option is a pseudo-generic ioctl-based interface to expose those perf counters. _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel