On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote: > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon > <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > > +Rob, Eric, Mark and more > > > > Hi, > > > > On Fri, 5 Apr 2019 16:20:45 +0100 > > Steven Price <steven.price@xxxxxxx> wrote: > > > > > On 04/04/2019 16:20, Boris Brezillon wrote: > > > > Hello, > > > > > > > > This patch adds new ioctls to expose GPU counters to userspace. > > > > These will be used by the mesa driver (should be posted soon). > > > > > > > > A few words about the implementation: I followed the VC4/Etnaviv model > > > > where perf counters are retrieved on a per-job basis. This allows one > > > > to have get accurate results when there are users using the GPU > > > > concurrently. > > > > AFAICT, the mali kbase is using a different approach where several > > > > users can register a performance monitor but with no way to have fined > > > > grained control over what job/GPU-context to track. > > > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can > > > be from different contexts (address spaces), and mali_kbase also fully > > > uses the _NEXT registers. So there can be a job from one context > > > executing on slot 0 and a job from a different context waiting in the > > > _NEXT registers. (And the same for slot 1). This means that there's no > > > (visible) gap between the first job finishing and the second job > > > starting. Early versions of the driver even had a throttle to avoid > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the > > > IRQ - but thankfully that's gone. > > > > > > The upshot is that it's basically impossible to measure "per-job" > > > counters when running at full speed. Because multiple jobs are running > > > and the driver doesn't actually know when one ends and the next starts. > > > > > > Since one of the primary use cases is to draw pretty graphs of the > > > system load [1], this "per-job" information isn't all that relevant (and > > > minimal performance overhead is important). And if you want to monitor > > > just one application it is usually easiest to ensure that it is the only > > > thing running. > > > > > > [1] > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer > > > > > > > This design choice comes at a cost: every time the perfmon context > > > > changes (the perfmon context is the list of currently active > > > > perfmons), the driver has to add a fence to prevent new jobs from > > > > corrupting counters that will be dumped by previous jobs. > > > > > > > > Let me know if that's an issue and if you think we should approach > > > > things differently. > > > > > > It depends what you expect to do with the counters. Per-job counters are > > > certainly useful sometimes. But serialising all jobs can mess up the > > > thing you are trying to measure the performance of. > > > > I finally found some time to work on v2 this morning, and it turns out > > implementing global perf monitors as done in mali_kbase means rewriting > > almost everything (apart from the perfcnt layout stuff). I'm not against > > doing that, but I'd like to be sure this is really what we want. > > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters > > through the pipe_query/AMD_perfmon interface if we don't have this > > job (or at least draw call) granularity? If not, should we keep the > > solution I'm proposing here to make sure counters values are accurate, > > or should we expose perf counters through a non-standard API? > > I think if you can't do per-draw level granularity, then you should > not try to implement AMD_perfmon.. instead the use case is more for a > sort of "gpu top" app (as opposed to something like frameretrace which > is taking per-draw-call level measurements from within the app. > Things that use AMD_perfmon are going to, I think, expect to query > values between individual glDraw calls, and you probably don't want to > flush tile passes 500 times per frame. > > (Although, I suppose if there are multiple GPUs where perfcntrs work > this way, it might be an interesting exercise to think about coming up > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so > you could have a single userspace tool that works across several > different drivers.) I agree. We've been pondering a lot of the same issues for Adreno. I would be greatly interested in seeing if we could come up with a standard solution we can use. Jordan > > > > > BTW, I'd like to remind you that serialization (waiting on the perfcnt > > fence) only happens if we have a perfmon context change between 2 > > consecutive jobs, which only happens when > > * 2 applications are running in // and at least one of them is > > monitored > > * or when userspace decides to stop monitoring things and dump counter > > values > > > > That means that, for the usual case (all perfmons disabled), there's > > almost zero overhead (just a few more checks in the submit job code). > > That also means that, if we ever decide to support global perfmon (perf > > monitors that track things globably) on top of the current approach, > > and only global perfmons are enabled, things won't be serialized as > > with the per-job approach, because everyone will share the same perfmon > > ctx (the same set of perfmons). > > > > I'd appreciate any feedback from people that have used perf counters > > (or implemented a way to dump them) on their platform. > > > > Thanks, > > > > Boris > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel