Re: [PATCH] drm/i915/pmu: Do not assume fixed hrtimer period

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 5 Jun 2018 12:35:36 +0100

On 05/06/2018 10:06, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-06-05 09:51:01)

On 04/06/2018 16:11, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2018-06-04 16:01:26)

On 04/06/2018 14:03, Chris Wilson wrote:
Quoting Chris Wilson (2018-06-04 13:59:12)
Quoting Tvrtko Ursulin (2018-06-04 13:52:39)
From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>

As Chris has discovered on his Ivybridge, and later automated test runs
have confirmed, on most of our platforms hrtimer faced with heavy GPU load
ca occasionally become sufficiently imprecise to affect PMU sampling

s/ca/can/

calculations.

This means we cannot assume sampling frequency is what we asked for, but
we need to measure the interval ourselves.

This patch is similar to Chris' original proposal for per-engine counters,
but instead of introducing a new set to work around the problem with
frequency sampling, it swaps around the way internal frequency accounting
is done. Instead of accumulating current frequency and dividing by
sampling frequency on readout, it accumulates frequency scaled by each
period.

My ulterior motive failed to win favour ;)
    
Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Suggested-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Reviewed-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

I should point out that even with this fix (or rather my patch), we can

I did not mean to steal it,

Don't mind, I view such patches as "this here is a problem, and what I
think can be done". I'm quite happy to be told "nope, the problem
is..."; the end result is we fix and move on.

just that you seemed very uncompromising on
the new counters approach. If you prefer you can respin your patch with
this approach for frequency counters, it is fine by me.

I'm just not convinced yet by the frequency sampler, and I still feel
that we would be better off with cycles. I just haven't found a
convincing argument ;)

Just imagine .unit file has Mcycles instead of MHz in it? :)

Since we went with this when adding it initially I think we should
exercise restrain with deprecating and adding so quickly.

still see errors of 25-30us, enough to fail the test. However, without
the fix our errors can be orders of magnitude worse (e.g. reporting 80us
busy instead of 500us).

Yeah I guess if the timer is delayed it is delayed and all samples just
won't be there. I don't see a way to fix that elegantly. Even practically.

Right, but it's not the case that we aren't sampling. If the timer was
delayed entirely, then we would see 0 (start_val == end_val). I'm not
sure exactly what goes wrong, but it probably is related to the timer
being unreliable. (It's just that in this case we are sampling a square
wave, so really hard to mess up!)

Hm maybe I am not completely understanding what you are seeing. I
thought that multiple timer overruns (aka delay by multiple periods)
explains everything.

Then even with this patch, if they happen to happen (!) at the end of a
sampling period (as observed from userspace), the large-ish chunk of
samples (depending on the total sampling duration) may be missing
(pending), and so still observed as fail.

Ah, but with the variable period timer, we can force the timer to run
before we report the same (perf_event_read). That should still be inside
the busy batch, and it still has the same inaccuracy (~28us, for a small
sample population, measuring the tail is on my wishlist):

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 34cc6f6..8578a43 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
                         val = engine->pmu.sample[sample].cur;
                 }
         } else {
+               if (hrtimer_cancel(&i915->pmu.timer)) {
+                       i915_sample(&i915->pmu.timer);
+                       hrtimer_start_expires(&i915->pmu.timer,
+                                             HRTIMER_MODE_ABS);
+               }
+
                 switch (event->attr.config) {
                 case I915_PMU_ACTUAL_FREQUENCY:
                         val =

I haven't thought enough about why it still has the tail, nor measured
the distribution of errors. I just wanted to warn that we aren't quite
out of the mess yet, but with the variable period we do at least stop
drowning!
  
Difference after this patch is that when the timer eventually fires the
counter will account for the delay, so subsequent queries will be fine.

If readout waited for at least one real sampling tick, that would be
solved, but by several other undesirable consequences. At least it would
slow down readout to 200Hz, but I don't know from the top of my head if
it wouldn't cause unresolvable deadlock scenarios, or if it is even
technically possible within the perf framework. (Since I don't think we
should do it, I don't even want to start thinking about it.)

The above should do what you have in mind, I think.

I think you got the direct sample criteria reversed - it should be if 
the timer wasn't running. If it was, it means it just exited so the 
sampling is already pretty correct.

Also I was thinking to limit this to only when the timer was delayed so 
to avoid many readers messing with a single timer alot. In other words 
we keep accepting to sampling error (0.5% I think for steady input 
signal) and only improve the timer delay scenario.

I'll send a patch to discuss on in more detail.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx