Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Robert Bragg <robert@xxxxxxxxxxxxx> · Wed, 4 May 2016 12:15:22 +0100

On Wed, May 4, 2016 at 10:04 AM, Martin Peres <martin.peres@xxxxxxxxxxxxxxx> wrote:
On 03/05/16 22:34, Robert Bragg wrote:

Sorry for the delay replying to this, I missed it.

No worries!

On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@xxxxxxx

<mailto:martin.peres@xxxxxxx>> wrote:

    On 20/04/16 17:23, Robert Bragg wrote:

        Gen graphics hardware can be set up to periodically write

        snapshots of

        performance counters into a circular buffer via its Observation

        Architecture and this patch exposes that capability to userspace

        via the

        i915 perf interface.

        Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx

        <mailto:chris@xxxxxxxxxxxxxxxxxx>>

        Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx

        <mailto:robert@xxxxxxxxxxxxx>>

        Signed-off-by: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx

        <mailto:zhenyuw@xxxxxxxxxxxxxxx>>

        ---

          drivers/gpu/drm/i915/i915_drv.h         |  56 +-

          drivers/gpu/drm/i915/i915_gem_context.c |  24 +-

          drivers/gpu/drm/i915/i915_perf.c        | 940

        +++++++++++++++++++++++++++++++-

          drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++

          include/uapi/drm/i915_drm.h             |  70 ++-

          5 files changed, 1408 insertions(+), 20 deletions(-)

        +

        +

        +       /* It takes a fairly long time for a new MUX

        configuration to

        +        * be be applied after these register writes. This delay

        +        * duration was derived empirically based on the

        render_basic

        +        * config but hopefully it covers the maximum configuration

        +        * latency...

        +        */

        +       mdelay(100);

    With such a HW and SW design, how can we ever expose hope to get any

    kind of performance when we are trying to monitor different metrics

    on each

    draw call? This may be acceptable for system monitoring, but it is

    problematic

    for the GL extensions :s

    Since it seems like we are going for a perf API, it means that for

    every change

    of metrics, we need to flush the commands, wait for the GPU to be

    done, then

    program the new set of metrics via an IOCTL, wait 100 ms, and then

    we may

    resume rendering ... until the next change. We are talking about a

    latency of

    6-7 frames at 60 Hz here... this is non-negligeable...

    I understand that we have a ton of counters and we may hide latency

    by not

    allowing using more than half of the counters for every draw call or

    frame, but

    even then, this 100ms delay is killing this approach altogether.

Although I'm also really unhappy about introducing this delay recently,

the impact of the delay is typically amortized somewhat by keeping a

configuration open as long as possible.

Even without this explicit delay here the OA unit isn't suited to being

reconfigured on a per draw call basis, though it is able to support per

draw call queries with the same config.

The above assessment assumes wanting to change config between draw calls

which is not something this driver aims to support - as the HW isn't

really designed for that model.

E.g. in the case of INTEL_performance_query, the backend keeps the i915

perf stream open until all OA based query objects are deleted - so you

have to be pretty explicit if you want to change config.

OK, I get your point. However, I still want to state that applications changing the set would see a disastrous effect as a 100 ms is enough to downclock both the CPU and GPU and that would dramatically alter the

metrics. Should we make it clear somewhere, either in the INTEL_performance_query or as a warning in mesa_performance if changing the set while running? I would think the latter would be preferable as it could also cover the case of the AMD extension which, IIRC, does not talk about the performance cost of changing the metrics. With this caveat made clear, it seems reasonable.

Yeah a KHR_debug performance warning sounds like a good idea.

In case you aren't familiar with how the GL_INTEL_performance_query side

of things works for OA counters; one thing to be aware of is that

there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either

side of a query which writes all the counters for the current OA config

(as configured via this i915 perf interface) to a buffer. In addition to

collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the

unit for periodic sampling to be able to account for potential counter

overflow.

Oh, the overflow case is mean. Doesn't the spec mandate the application to read at least every second? This is the case for the timestamp queries.

For a Haswell GT3 system with 40EUs @ 1GHz some aggregate EU counters may overflow their 32bits in approximately 40milliseconds. It should be pretty unusual to see a draw call last that long, but not unimaginable. Might also be a good draw call to focus on profiling too :-)

For Gen8+ a bunch of the A counters can be reported with 40bits to mitigate this issue.

It also might be worth keeping in mind that per draw queries will anyway

trash the pipelining of work, since it's necessary to put stalls between

the draw calls to avoid conflated metrics (not to do with the details of

this driver) so use cases will probably be limited to those that just

want the draw call numbers but don't mind ruining overall

frame/application performance. Periodic sampling or swap-to-swap queries

would be better suited to cases that should minimize their impact.

Yes, I agree that there will always be a cost, but with the design implemented in nouveau (which barely involves the CPU at all), the pipelining is almost unaffected. As in, monitoring every draw call with a different metric would lower the performance of glxgears (worst case I could think off) but still keep thousands of FPS.

I guess it just has different trade offs.

While it sounds like we have a typically higher cost to reconfigure OA (at least if touching the MUX) once the config is fixed (which can be done before measuring anything), then I guess the pipelining for queries might be slightly better with MI_REPORT_PERF_COUNT commands than something requiring interrupting + executing work on the cpu to switch config (even if it's cheaper than an OA re-config). I guess nouveau would have the same need to insert GPU pipeline stalls (just gpu syncing with gpu) to avoid conflating neighbouring draw call metrics, and maybe the bubbles from those that can swallow the latency of the software methods.

glxgears might not really exaggerate draw call pipeline stall issues with only 6 cheap primitives per gear. glxgears hammers context switching more so than drawing anything. I think a pessimal case would be an app that depends on large numbers of draw calls per frame that each do enough real work that stalling for their completion is also measurable.

Funnily enough enabling the OA unit with glxgears can be kind of problematic for Gen8+ which automatically writes reports on context switch due to the spam of generating all of those context switch reports.

The driver is already usable with gputop with this delay and considering

how config changes are typically associated with user interaction I

wouldn't see this as a show stopper - even though it's not ideal. I

think the assertions about it being unusable with GL, were a little

overstated based on making frequent OA config changes which is not

really how the interface is intended to be used.

Yeah, but a performance warning in mesa, I would be OK with this change. Thanks for taking the time to explain!

A performance warning sounds like a sensible idea yup.

Regards,
- Robert

Thanks for starting to take a look through the code.

Kind Regards,

- Robert

Martin

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx