Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Robert Bragg <robert@xxxxxxxxxxxxx> · Tue, 3 May 2016 20:34:46 +0100

Sorry for the delay replying to this, I missed it.

On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@xxxxxxx> wrote:
On 20/04/16 17:23, Robert Bragg wrote:

Gen graphics hardware can be set up to periodically write snapshots of

performance counters into a circular buffer via its Observation

Architecture and this patch exposes that capability to userspace via the

i915 perf interface.

Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx>

Signed-off-by: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx>

---

  drivers/gpu/drm/i915/i915_drv.h         |  56 +-

  drivers/gpu/drm/i915/i915_gem_context.c |  24 +-

  drivers/gpu/drm/i915/i915_perf.c        | 940 +++++++++++++++++++++++++++++++-

  drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++

  include/uapi/drm/i915_drm.h             |  70 ++-

  5 files changed, 1408 insertions(+), 20 deletions(-)

+

+

+       /* It takes a fairly long time for a new MUX configuration to

+        * be be applied after these register writes. This delay

+        * duration was derived empirically based on the render_basic

+        * config but hopefully it covers the maximum configuration

+        * latency...

+        */

+       mdelay(100);

With such a HW and SW design, how can we ever expose hope to get any

kind of performance when we are trying to monitor different metrics on each

draw call? This may be acceptable for system monitoring, but it is problematic

for the GL extensions :s

Since it seems like we are going for a perf API, it means that for every change

of metrics, we need to flush the commands, wait for the GPU to be done, then

program the new set of metrics via an IOCTL, wait 100 ms, and then we may

resume rendering ... until the next change. We are talking about a latency of

6-7 frames at 60 Hz here... this is non-negligeable...

I understand that we have a ton of counters and we may hide latency by not

allowing using more than half of the counters for every draw call or frame, but

even then, this 100ms delay is killing this approach altogether.

Although I'm also really unhappy about introducing 
this delay recently, the impact of the delay is typically amortized 
somewhat by keeping a configuration open as long as possible.

Even
 without this explicit delay here the OA unit isn't suited  to being 
reconfigured on a per draw call basis, though it is able to 
support per draw call queries with the same config.

The above assessment assumes wanting to change config between draw calls which is not something this driver aims to support - as the HW isn't really designed for that model.

E.g. in the 
case of INTEL_performance_query, the backend keeps the i915 perf stream 
open until all OA based  query objects are deleted - so you  have to be 
pretty explicit if you want to change config.

Considering the sets available on Haswell:
* Render Metrics Basic
* Compute Metrics Basic
* Compute Metrics Extended
* Memory Reads Distribution
* Memory Writes Distribution
* Metric set SamplerBalance

Each of  these configs can expose around 50 counters as a set.

A GL application is most likely just going to use the render basic set, and In the case of a tool like gputop/GPA then changing config would usually be driven by some user 
interaction to select a set of metrics, where even a 100ms delay will go unnoticed.

In case you aren't familiar with how the GL_INTEL_performance_query side of things works for OA counters; one thing to be aware of is that there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either side of a query which writes all the counters for the current OA config (as configured via this i915 perf interface) to a buffer. In addition to collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the unit for periodic sampling to be able to account for potential counter overflow.

It also might be worth keeping in mind that per draw queries will 
anyway trash the pipelining of work, since it's necessary to put stalls
 between the draw calls to avoid conflated metrics (not to do with 
the details of this driver) so use cases will probably be limited to 
those that just want the draw call numbers but don't mind ruining overall 
frame/application performance. Periodic sampling or 
swap-to-swap queries would be better suited to cases that should 
minimize their impact. 

To be honest, if it indeed is an HW bug, then the approach that Samuel Pitoiset

and I used for Nouveau involving pushing an handle representing a

pre-computed configuration to the command buffer so as a software method

can be ask the kernel to reprogram the counters with as little idle time as

possible, would be useless as waiting for the GPU to be idle would usually not

take more than a few ms... which is nothing compared to waiting 100ms.

Yeah, I think this is a really quite different programming model to what the OA unit is geared for, even if we can somehow knock out this 100ms MUX config delay.

So, now, the elephant in the room, how can it take that long to apply the

change? Are the OA registers double buffered (NVIDIA's are, so as we can

reconfigure and start monitoring multiple counters at the same time)?

Based on my understanding of how the HW works internally I can see how some delay would be expected, but can't currently  fathom why it would need to have this order of magnitude, and so the delay is currently simply based on experimentation where I was getting unit test failures at 10ms, for invalid looking reports, but the tests ran reliably at 100ms.

OA configuration state isn't  double buffered to allow configuration while in use.

Maybe this 100ms is the polling period and the HW does not allow changing

the configuration in the middle of a polling session. In this case, this delay

should be dependent on the polling frequency. But even then, I would really

hope that the HW would allow us to tear down everything, reconfigure and

start polling again without waiting for the next tick. If not possible, maybe we

can change the frequency for the polling clock to make the polling event happen

sooner.

The tests currently test periods from 160ns to 168 milliseconds while the delay required falls somewhere between 10 and 100 milliseconds. I think I'd expect the delay to be > all periods tested if this was the link.

Generally this seems unlikely to me, in part considering how the MUX isn't really part of the OA unit that handles periodic sampling. I wouldn't rule out some interaction though so some experimenting along these lines could be interesting.

HW delays are usually a few microseconds, not milliseconds, that really suggests

that something funny is happening and the HW design is not understood properly.

Yup.

Although I understand more about the HW than I can write up here, I can't currently see why the HW should ever really take this long to apply a MUX config - although I can see why some delay would be required.

It's on my list of things to try and get feedback/ideas on from the OA architect/HW engineers. I brought this up briefly some time ago but we didn't have time to go into details.

If the documentation has nothing on this and the HW teams cannot help, then I

suggest a little REing session 

There's no  precisely documented delay requirement. Insofar as REing is the process of inferring how black box HW works through poking it with a stick and seeing how it reacts, then yep more of that may be necessary. At least in this case the HW isn't really a black box (maybe stain glass), where I hopefully have a fairly good sense of how the HW is designed and can prod folks closer to the HW for feedback/ideas.

So far I haven't spent too long investigating this besides recently homing in on needing a delay here when my unit tests were failing.

I really want to see this work land, but the way I see

it right now is that we cannot rely on it because of this bug. Maybe fixing this bug

would require changing the architecture, so better address it before landing the

patches.

I think it's unlikely to change the architecture; rather we might just find some other things to frob that make the MUX config apply faster (e.g. clock gating issue); we find a way to get explicit feedback of completion so we can minimize the delay or a better understanding that lets us choose a shorter delay in most cases.

The driver is already usable with gputop with this delay and considering how config changes are typically associated with user interaction I wouldn't see this as a show stopper - even though it's not ideal. I think the assertions about it being unusable with GL, were a little overstated based on making frequent OA config changes which is not really how the interface is intended to be used.

Thanks for starting to take a look through the code.

Kind Regards,
- Robert

Worst case scenario, do not hesitate to contact me if non of the proposed

explanation pans out, I will take the time to read through the OA material and try my

REing skills on it. As I said, I really want to see this upstream! 

Sorry...

Martin

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel