Re: [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Robert Bragg <robert@xxxxxxxxxxxxx> · Wed, 4 May 2016 10:49:53 +0100

On Wed, May 4, 2016 at 10:09 AM, Martin Peres <martin.peres@xxxxxxxxxxxxxxx> wrote:
On 03/05/16 23:03, Robert Bragg wrote:

On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert@xxxxxxxxxxxxx

<mailto:robert@xxxxxxxxxxxxx>> wrote:

    Sorry for the delay replying to this, I missed it.

    On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres@xxxxxxx

    <mailto:martin.peres@xxxxxxx>> wrote:

        On 20/04/16 17:23, Robert Bragg wrote:

            Gen graphics hardware can be set up to periodically write

            snapshots of

            performance counters into a circular buffer via its Observation

            Architecture and this patch exposes that capability to

            userspace via the

            i915 perf interface.

            Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx

            <mailto:chris@xxxxxxxxxxxxxxxxxx>>

            Signed-off-by: Robert Bragg <robert@xxxxxxxxxxxxx

            <mailto:robert@xxxxxxxxxxxxx>>

            Signed-off-by: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx

            <mailto:zhenyuw@xxxxxxxxxxxxxxx>>

            ---

              drivers/gpu/drm/i915/i915_drv.h         |  56 +-

              drivers/gpu/drm/i915/i915_gem_context.c |  24 +-

              drivers/gpu/drm/i915/i915_perf.c        | 940

            +++++++++++++++++++++++++++++++-

              drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++

              include/uapi/drm/i915_drm.h             |  70 ++-

              5 files changed, 1408 insertions(+), 20 deletions(-)

            +

            +

            +       /* It takes a fairly long time for a new MUX

            configuration to

            +        * be be applied after these register writes. This delay

            +        * duration was derived empirically based on the

            render_basic

            +        * config but hopefully it covers the maximum

            configuration

            +        * latency...

            +        */

            +       mdelay(100);

        With such a HW and SW design, how can we ever expose hope to get any

        kind of performance when we are trying to monitor different

        metrics on each

        draw call? This may be acceptable for system monitoring, but it

        is problematic

        for the GL extensions :s

        Since it seems like we are going for a perf API, it means that

        for every change

        of metrics, we need to flush the commands, wait for the GPU to

        be done, then

        program the new set of metrics via an IOCTL, wait 100 ms, and

        then we may

        resume rendering ... until the next change. We are talking about

        a latency of

        6-7 frames at 60 Hz here... this is non-negligeable...

        I understand that we have a ton of counters and we may hide

        latency by not

        allowing using more than half of the counters for every draw

        call or frame, but

        even then, this 100ms delay is killing this approach altogether.

So revisiting this to double check how things fail with my latest

driver/tests without the delay, I apparently can't reproduce test

failures without the delay any more...

I think the explanation is that since first adding the delay to the

driver I also made the the driver a bit more careful to not forward

spurious reports that look invalid due to a zeroed report id field, and

that mechanism keeps the unit tests happy, even though there are still

some number of invalid reports generated if we don't wait.

One problem with simply having no delay is that the driver prints an

error if it sees an invalid reports so I get a lot of 'Skipping

spurious, invalid OA report' dmesg spam. Also this was intended more as

a last resort mechanism, and I wouldn't feel too happy about squashing

the error message and potentially sweeping other error cases under the

carpet.

Experimenting to see if the delay can at least be reduced, I brought the

delay up in millisecond increments and found that although I still see a

lot of spurious reports only waiting 1 or 5 milliseconds, at 10

milliseconds its reduced quite a bit and at 15 milliseconds I don't seem

to have any errors.

15 milliseconds is still a long time, but at least not as long as 100.

OK, so the issue does not come from the HW after all, great!

Erm, I'm not sure that's a conclusion we can make here...

The upshot here was really just reducing the delay from 100ms to 15ms. Previously I arrived at a workable delay by jumping the delay in orders of magnitude with 10ms failing, 100ms passing and I didn't try and refine it further. Here I've looked at delays between 10 and 100ms.

The other thing is observing that because the kernel is checking for invalid reports (generated by the hardware) before forwarding to userspace the lack of a delay no longer triggers i-g-t failures because the invalid data won't reach i-g-t any more - though the invalid reports are still a thing to avoid.

Now, my main question is, why are spurious events generated when changing the MUX's value? I can understand that we would need to ignore the reading that came right after the change, but other than this,  I am a bit at a loss.

The MUX selects 16 signals that the OA unit can turn into 16 separate counters by basically counting the signal changes. (there's some fancy fixed function logic that can affect this but that's the general idea).

If the MUX is in the middle of being re-programmed then some subset of those 16 signals are for who knows what.

After programming the MUX we will go on to configure the OA unit and the tests will enable periodic sampling which (if we have no delay) will sample the OA counters that are currently being fed by undefined signals.

So as far as that goes it makes sense to me to expect bad data if we don't wait for the MUX config to land properly. Something I don't really know is how come we're seemingly lucky to have the reports be cleanly invalid with a zero report-id, instead of just having junk data that would be harder to recognise.

I am a bit swamped with other tasks right now, but I would love to spend more time reviewing your code as I really want to see this upstream!

No worries.

I can hopefully send out my i-g-t tests this afternoon too which should hopefully give us all the pieces to be able seriously consider hopefully landing things soon.

Regards,
- Robert

Martin

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx