On Fri, 8 Mar 2024 17:02:15 +0000 Liviu Dudau <liviu.dudau@xxxxxxx> wrote: > On Fri, Mar 08, 2024 at 04:15:15PM +0100, Boris Brezillon wrote: > > On Fri, 8 Mar 2024 13:52:18 +0000 > > Liviu Dudau <liviu.dudau@xxxxxxx> wrote: > > > > > Hi Adrián, > > > > > > Thanks for the patch and appologies for taking a bit longer to respond, > > > I was trying to gather some internal Arm feedback before replying. > > > > > > On Tue, Mar 05, 2024 at 04:58:16PM +0000, Adrián Larumbe wrote: > > > > This brings in support for Panthor's HW performance counters and querying > > > > them from UM through a specific ioctl(). The code is inspired by existing > > > > functionality for the Panfrost driver, with some noteworthy differences: > > > > > > > > - Sample size is now reported by the firmware rather than having to reckon > > > > it by hand > > > > - Counter samples are chained in a ring buffer that can be accessed > > > > concurrently, but only from threads within a single context (this is > > > > because of a HW limitation). > > > > - List of enabled counters must be explicitly told from UM > > > > - Rather than allocating the BO that will contain the perfcounter values > > > > in the render context's address space, the samples ring buffer is mapped > > > > onto the MCU's VM. > > > > - If more than one thread within the same context tries to dump a sample, > > > > then the kernel will copy the same frame to every single thread that was > > > > able to join the dump queue right before the FW finished processing the > > > > sample request. > > > > - UM must provide a BO handle for retrieval of perfcnt values rather > > > > than passing a user virtual address. > > > > > > > > The reason multicontext access to the driver's perfcnt ioctl interface > > > > isn't tolerated is because toggling a different set of counters than the > > > > current one implies a counter reset, which also messes up with the ring > > > > buffer's extraction and insertion pointers. This is an unfortunate > > > > hardware limitation. > > > > > > There are a few issues that we would like to improve with this patch. > > > > > > I'm not sure what user space app has been used for testing this (I'm guessing > > > gputop from igt?) but whatever is used it needs to understand the counters > > > being exposed. > > > > We are using perfetto to expose perfcounters. > > > > > In your patch there is no information given to the user space > > > about the layout of the counters and / or their order. Where is this being > > > planned to be defined? > > > > That's done on purpose. We want to keep the kernel side as dumb as > > possible so we don't have to maintain a perfcounter database there. All > > the kernel needs to know is how much data it should transfer pass to > > userspace, and that's pretty much it. > > I was not thinking about a perfcounter database but hints about counter value > size. In the future we might have 64bits counters and you will not be able to > tell only from the sample size if you're talking with an old firmware or not. No, but the FW should tell us through its versioning scheme :P. I mean, if the counter semantics for a given GPU changes with the FW version, we should expose the FW version to userspace, and let userspace maintain proper <gpu_id,fw_version> => perfcnt_defs mappings. This being said, I'm a bit concerned by this kind of non-backward compatible ABI changes. If we're not careful, we will end up in the same situation the proprietary PowerVR driver was, where userspace and FW have to match for things to work properly, and that's not great... > (Read: future GPUs are likely to go bigger on number of perf counters before > they gain higher definition). > > > > > > Long term, I think it would be good to have details > > > about the counters available. > > > > The perfcounter definitions are currently declared in mesa (see the G57 > > perfcounter definitions for instance [1]). Mesa also contains a perfetto > > datasource for Panfrost [2]. > > Sorry, I've missed that perfetto got updated. > > > > > > > > > The other issue we can see is with the perfcnt_process_sample() and its > > > handling of threshold event and overflows. If the userspace doesn't sample > > > quick enough and we cross the threshold we are going to lose samples and > > > there is no mechanism to communicate to user space that the values they > > > are getting have gaps. I believe the default mode for the firmware is to > > > give you counter values relative to the last read value, so if you drop > > > samples you're not going to make any sense of the data. > > > > If we get relative values, that's indeed a problem. I thought we were > > getting absolute values though, in which case, if you miss two 32-bit > > wraparounds, either your sampling rate is very slow, or events occur at > > a high rate. > > First CSF GPUs have some erratas around perf counters that firmware tries to > hide. I need to dig a bit deeper into what firmware does for each GPU before > confirming the counting mode. > > The main issue with the code is that we should not be dropping samples at > all, even if they are absolute values, as there will be gaps in the analysis. The only events that might trigger automatic sampling are power state transitions and protection mode entry/exit. With PM being forced to always on at the moment, I was assuming we wouldn't care about the intermediate results if we get absolute values. Of course, that's completely different if we're passed relative values. > Looking at perfcnt_process_sample(), it does not increase the sampling rate > if we reach the threshold, nor does it tell the user space to do so. From our > DDK experience, if you've reached the threshold with the app sampling > then you're likely to overflow as well, missing samples. Again, if counters are absolute, overflowing is okay, it's overflowing twice that's not okay, because then you lose a full 2^32 offset (or more if you missed more than two overflows). Anyway, if we want to make sure perfcnt users get all the samples, we're probably better off providing a uAPI that mimics the FW interface. That is, allow userspace to manage the ringbuf directly (with a new ioctl to control the ringbuf head/tail and lets the kernel report low threshold/sample lost events) instead of keeping this ringbuf-slot -> user-provided-buffer copy we have right now.