Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Wed, May 4, 2016 at 10:04 AM, Martin Peres <martin.peres@xxxxxxxxxxxxxxx> wrote:
On 03/05/16 22:34, Robert Bragg wrote:
Sorry for the delay replying to this, I missed it.

No worries!


On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@xxxxxxx
<mailto:martin.peres@xxxxxxx>> wrote:

    On 20/04/16 17:23, Robert Bragg wrote:

        Gen graphics hardware can be set up to periodically write
        snapshots of
        performance counters into a circular buffer via its Observation
        Architecture and this patch exposes that capability to userspace
        via the
        i915 perf interface.

        Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx
        <mailto:chris@xxxxxxxxxxxxxxxxxx>>
        Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx
        <mailto:robert@xxxxxxxxxxxxx>>
        Signed-off-by: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx
        <mailto:zhenyuw@xxxxxxxxxxxxxxx>>

        ---
          drivers/gpu/drm/i915/i915_drv.h         |  56 +-
          drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
          drivers/gpu/drm/i915/i915_perf.c        | 940
        +++++++++++++++++++++++++++++++-
          drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
          include/uapi/drm/i915_drm.h             |  70 ++-
          5 files changed, 1408 insertions(+), 20 deletions(-)

        +
        +
        +       /* It takes a fairly long time for a new MUX
        configuration to
        +        * be be applied after these register writes. This delay
        +        * duration was derived empirically based on the
        render_basic
        +        * config but hopefully it covers the maximum configuration
        +        * latency...
        +        */
        +       mdelay(100);


    With such a HW and SW design, how can we ever expose hope to get any
    kind of performance when we are trying to monitor different metrics
    on each
    draw call? This may be acceptable for system monitoring, but it is
    problematic
    for the GL extensions :s


    Since it seems like we are going for a perf API, it means that for
    every change
    of metrics, we need to flush the commands, wait for the GPU to be
    done, then
    program the new set of metrics via an IOCTL, wait 100 ms, and then
    we may
    resume rendering ... until the next change. We are talking about a
    latency of
    6-7 frames at 60 Hz here... this is non-negligeable...


    I understand that we have a ton of counters and we may hide latency
    by not
    allowing using more than half of the counters for every draw call or
    frame, but
    even then, this 100ms delay is killing this approach altogether.



Although I'm also really unhappy about introducing this delay recently,
the impact of the delay is typically amortized somewhat by keeping a
configuration open as long as possible.

Even without this explicit delay here the OA unit isn't suited to being
reconfigured on a per draw call basis, though it is able to support per
draw call queries with the same config.

The above assessment assumes wanting to change config between draw calls
which is not something this driver aims to support - as the HW isn't
really designed for that model.

E.g. in the case of INTEL_performance_query, the backend keeps the i915
perf stream open until all OA based query objects are deleted - so you
have to be pretty explicit if you want to change config.

OK, I get your point. However, I still want to state that applications changing the set would see a disastrous effect as a 100 ms is enough to downclock both the CPU and GPU and that would dramatically alter the
metrics. Should we make it clear somewhere, either in the INTEL_performance_query or as a warning in mesa_performance if changing the set while running? I would think the latter would be preferable as it could also cover the case of the AMD extension which, IIRC, does not talk about the performance cost of changing the metrics. With this caveat made clear, it seems reasonable.

Yeah a KHR_debug performance warning sounds like a good idea.
 


In case you aren't familiar with how the GL_INTEL_performance_query side
of things works for OA counters; one thing to be aware of is that
there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either
side of a query which writes all the counters for the current OA config
(as configured via this i915 perf interface) to a buffer. In addition to
collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the
unit for periodic sampling to be able to account for potential counter
overflow.

Oh, the overflow case is mean. Doesn't the spec mandate the application to read at least every second? This is the case for the timestamp queries.

For a Haswell GT3 system with 40EUs @ 1GHz some aggregate EU counters may overflow their 32bits in approximately 40milliseconds. It should be pretty unusual to see a draw call last that long, but not unimaginable. Might also be a good draw call to focus on profiling too :-)

For Gen8+ a bunch of the A counters can be reported with 40bits to mitigate this issue.
 



It also might be worth keeping in mind that per draw queries will anyway
trash the pipelining of work, since it's necessary to put stalls between
the draw calls to avoid conflated metrics (not to do with the details of
this driver) so use cases will probably be limited to those that just
want the draw call numbers but don't mind ruining overall
frame/application performance. Periodic sampling or swap-to-swap queries
would be better suited to cases that should minimize their impact.

Yes, I agree that there will always be a cost, but with the design implemented in nouveau (which barely involves the CPU at all), the pipelining is almost unaffected. As in, monitoring every draw call with a different metric would lower the performance of glxgears (worst case I could think off) but still keep thousands of FPS.

I guess it just has different trade offs.

While it sounds like we have a typically higher cost to reconfigure OA (at least if touching the MUX) once the config is fixed (which can be done before measuring anything), then I guess the pipelining for queries might be slightly better with MI_REPORT_PERF_COUNT commands than something requiring interrupting + executing work on the cpu to switch config (even if it's cheaper than an OA re-config). I guess nouveau would have the same need to insert GPU pipeline stalls (just gpu syncing with gpu) to avoid conflating neighbouring draw call metrics, and maybe the bubbles from those that can swallow the latency of the software methods.

glxgears might not really exaggerate draw call pipeline stall issues with only 6 cheap primitives per gear. glxgears hammers context switching more so than drawing anything. I think a pessimal case would be an app that depends on large numbers of draw calls per frame that each do enough real work that stalling for their completion is also measurable.

Funnily enough enabling the OA unit with glxgears can be kind of problematic for Gen8+ which automatically writes reports on context switch due to the spam of generating all of those context switch reports.



The driver is already usable with gputop with this delay and considering
how config changes are typically associated with user interaction I
wouldn't see this as a show stopper - even though it's not ideal. I
think the assertions about it being unusable with GL, were a little
overstated based on making frequent OA config changes which is not
really how the interface is intended to be used.

Yeah, but a performance warning in mesa, I would be OK with this change. Thanks for taking the time to explain!

A performance warning sounds like a sensible idea yup.

Regards,
- Robert
 



Thanks for starting to take a look through the code.

Kind Regards,
- Robert

Martin

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux