Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Martin Peres <martin.peres@xxxxxxxxxxxxxxx> · Wed, 4 May 2016 12:04:21 +0300

On 03/05/16 22:34, Robert Bragg wrote:
Sorry for the delay replying to this, I missed it.

No worries!

On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@xxxxxxx
<mailto:martin.peres@xxxxxxx>> wrote:

    On 20/04/16 17:23, Robert Bragg wrote:

        Gen graphics hardware can be set up to periodically write
        snapshots of
        performance counters into a circular buffer via its Observation
        Architecture and this patch exposes that capability to userspace
        via the
        i915 perf interface.

        Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx
        <mailto:chris@xxxxxxxxxxxxxxxxxx>>
        Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx
        <mailto:robert@xxxxxxxxxxxxx>>
        Signed-off-by: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx
        <mailto:zhenyuw@xxxxxxxxxxxxxxx>>
        ---
          drivers/gpu/drm/i915/i915_drv.h         |  56 +-
          drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
          drivers/gpu/drm/i915/i915_perf.c        | 940
        +++++++++++++++++++++++++++++++-
          drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
          include/uapi/drm/i915_drm.h             |  70 ++-
          5 files changed, 1408 insertions(+), 20 deletions(-)

        +
        +
        +       /* It takes a fairly long time for a new MUX
        configuration to
        +        * be be applied after these register writes. This delay
        +        * duration was derived empirically based on the
        render_basic
        +        * config but hopefully it covers the maximum configuration
        +        * latency...
        +        */
        +       mdelay(100);

    With such a HW and SW design, how can we ever expose hope to get any
    kind of performance when we are trying to monitor different metrics
    on each
    draw call? This may be acceptable for system monitoring, but it is
    problematic
    for the GL extensions :s

    Since it seems like we are going for a perf API, it means that for
    every change
    of metrics, we need to flush the commands, wait for the GPU to be
    done, then
    program the new set of metrics via an IOCTL, wait 100 ms, and then
    we may
    resume rendering ... until the next change. We are talking about a
    latency of
    6-7 frames at 60 Hz here... this is non-negligeable...

    I understand that we have a ton of counters and we may hide latency
    by not
    allowing using more than half of the counters for every draw call or
    frame, but
    even then, this 100ms delay is killing this approach altogether.

Although I'm also really unhappy about introducing this delay recently,
the impact of the delay is typically amortized somewhat by keeping a
configuration open as long as possible.

Even without this explicit delay here the OA unit isn't suited to being
reconfigured on a per draw call basis, though it is able to support per
draw call queries with the same config.

The above assessment assumes wanting to change config between draw calls
which is not something this driver aims to support - as the HW isn't
really designed for that model.

E.g. in the case of INTEL_performance_query, the backend keeps the i915
perf stream open until all OA based query objects are deleted - so you
have to be pretty explicit if you want to change config.

OK, I get your point. However, I still want to state that applications 
changing the set would see a disastrous effect as a 100 ms is enough to 
downclock both the CPU and GPU and that would dramatically alter the
metrics. Should we make it clear somewhere, either in the 
INTEL_performance_query or as a warning in mesa_performance if changing 
the set while running? I would think the latter would be preferable as 
it could also cover the case of the AMD extension which, IIRC, does not 
talk about the performance cost of changing the metrics. With this 
caveat made clear, it seems reasonable.

Considering the sets available on Haswell:
* Render Metrics Basic
* Compute Metrics Basic
* Compute Metrics Extended
* Memory Reads Distribution
* Memory Writes Distribution
* Metric set SamplerBalance

Each of these configs can expose around 50 counters as a set.

A GL application is most likely just going to use the render basic set,
and In the case of a tool like gputop/GPA then changing config would
usually be driven by some user interaction to select a set of metrics,
where even a 100ms delay will go unnoticed.

100 ms is becoming visible, but I agree, it would not be a show stopper 
for sure.

On the APITRACE side, this should not be an issue, because we do not 
change the set of metrics while running.

In case you aren't familiar with how the GL_INTEL_performance_query side
of things works for OA counters; one thing to be aware of is that
there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either
side of a query which writes all the counters for the current OA config
(as configured via this i915 perf interface) to a buffer. In addition to
collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the
unit for periodic sampling to be able to account for potential counter
overflow.

Oh, the overflow case is mean. Doesn't the spec mandate the application 
to read at least every second? This is the case for the timestamp queries.

It also might be worth keeping in mind that per draw queries will anyway
trash the pipelining of work, since it's necessary to put stalls between
the draw calls to avoid conflated metrics (not to do with the details of
this driver) so use cases will probably be limited to those that just
want the draw call numbers but don't mind ruining overall
frame/application performance. Periodic sampling or swap-to-swap queries
would be better suited to cases that should minimize their impact.

Yes, I agree that there will always be a cost, but with the design 
implemented in nouveau (which barely involves the CPU at all), the 
pipelining is almost unaffected. As in, monitoring every draw call with 
a different metric would lower the performance of glxgears (worst case I 
could think off) but still keep thousands of FPS.

    To be honest, if it indeed is an HW bug, then the approach that
    Samuel Pitoiset
    and I used for Nouveau involving pushing an handle representing a
    pre-computed configuration to the command buffer so as a software method
    can be ask the kernel to reprogram the counters with as little idle
    time as
    possible, would be useless as waiting for the GPU to be idle would
    usually not
    take more than a few ms... which is nothing compared to waiting 100ms.

Yeah, I think this is a really quite different programming model to what
the OA unit is geared for, even if we can somehow knock out this 100ms
MUX config delay.

Too bad :)

    So, now, the elephant in the room, how can it take that long to
    apply the
    change? Are the OA registers double buffered (NVIDIA's are, so as we can
    reconfigure and start monitoring multiple counters at the same time)?

Based on my understanding of how the HW works internally I can see how
some delay would be expected, but can't currently fathom why it would
need to have this order of magnitude, and so the delay is currently
simply based on experimentation where I was getting unit test failures
at 10ms, for invalid looking reports, but the tests ran reliably at 100ms.

OA configuration state isn't double buffered to allow configuration
while in use.

    Maybe this 100ms is the polling period and the HW does not allow
    changing
    the configuration in the middle of a polling session. In this case,
    this delay
    should be dependent on the polling frequency. But even then, I would
    really
    hope that the HW would allow us to tear down everything, reconfigure and
    start polling again without waiting for the next tick. If not
    possible, maybe we
    can change the frequency for the polling clock to make the polling
    event happen
    sooner.

The tests currently test periods from 160ns to 168 milliseconds while
the delay required falls somewhere between 10 and 100 milliseconds. I
think I'd expect the delay to be > all periods tested if this was the link.

Thanks, definitely the kind of information that is valuable for 
understanding this issue!

Generally this seems unlikely to me, in part considering how the MUX
isn't really part of the OA unit that handles periodic sampling. I
wouldn't rule out some interaction though so some experimenting along
these lines could be interesting.

That indeed makes it less likely. Interactions increase the BOM!

    HW delays are usually a few microseconds, not milliseconds, that
    really suggests
    that something funny is happening and the HW design is not
    understood properly.

Yup.

Although I understand more about the HW than I can write up here, I
can't currently see why the HW should ever really take this long to
apply a MUX config - although I can see why some delay would be required.

It's on my list of things to try and get feedback/ideas on from the OA
architect/HW engineers. I brought this up briefly some time ago but we
didn't have time to go into details.

Sounds like a good idea!

    If the documentation has nothing on this and the HW teams cannot
    help, then I
    suggest a little REing session

There's no precisely documented delay requirement. Insofar as REing is
the process of inferring how black box HW works through poking it with a
stick and seeing how it reacts, then yep more of that may be necessary.
At least in this case the HW isn't really a black box (maybe stain
glass), where I hopefully have a fairly good sense of how the HW is
designed and can prod folks closer to the HW for feedback/ideas.

So far I haven't spent too long investigating this besides recently
homing in on needing a delay here when my unit tests were failing.

ACK! Thanks for the info!

    I really want to see this work land, but the way I see
    it right now is that we cannot rely on it because of this bug. Maybe
    fixing this bug
    would require changing the architecture, so better address it before
    landing the
    patches.

I think it's unlikely to change the architecture; rather we might just
find some other things to frob that make the MUX config apply faster
(e.g. clock gating issue); we find a way to get explicit feedback of
completion so we can minimize the delay or a better understanding that
lets us choose a shorter delay in most cases.

Yes, clock gating may be one issue here, even though it would be a funny 
hw design to clock gate the bus to a register...

The driver is already usable with gputop with this delay and considering
how config changes are typically associated with user interaction I
wouldn't see this as a show stopper - even though it's not ideal. I
think the assertions about it being unusable with GL, were a little
overstated based on making frequent OA config changes which is not
really how the interface is intended to be used.

Yeah, but a performance warning in mesa, I would be OK with this change. 
Thanks for taking the time to explain!

Thanks for starting to take a look through the code.

Kind Regards,
- Robert

Martin
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx