Re: [PATCH 4/7] drm/i915/perf: lock powergating configuration to default when active

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Wed, 12 Sep 2018 09:03:27 +0100

On 11/09/2018 21:11, Lionel Landwerlin wrote:
On 10/09/2018 14:44, Tvrtko Ursulin wrote:

On 07/09/2018 10:55, Lionel Landwerlin wrote:
On 07/09/2018 10:39, Tvrtko Ursulin wrote:

On 07/09/2018 10:23, Lionel Landwerlin wrote:
On 07/09/2018 09:26, Tvrtko Ursulin wrote:

On 06/09/2018 11:36, Lionel Landwerlin wrote:
On 06/09/2018 11:22, Chris Wilson wrote:
Quoting Lionel Landwerlin (2018-09-06 11:18:01)
On 06/09/2018 11:10, Chris Wilson wrote:
Quoting Lionel Landwerlin (2018-09-06 10:57:47)
On 05/09/2018 15:22, Tvrtko Ursulin wrote:
From: Lionel Landwerlin <lionel.g.landwerlin@xxxxxxxxx>

If some of the contexts submitting workloads to the GPU have 
been
configured to shutdown slices/subslices, we might loose the NOA
configurations written in the NOA muxes.

One possible solution to this problem is to reprogram the 
NOA muxes
when we switch to a new context. We initially tried this in the
workaround batchbuffer but some concerns where raised about 
the cost
of reprogramming at every context switch. This solution is 
also not
without consequences from the userspace point of view. 
Reprogramming
of the muxes can only happen once the powergating 
configuration has
changed (which happens after context switch). This means for 
a window
of time during the recording, counters recorded by the OA 
unit might
be invalid. This requires userspace dealing with OA reports 
to discard
the invalid values.

Minimizing the reprogramming could be implemented by 
tracking of the
last programmed configuration somewhere in GGTT and use 
MI_PREDICATE
to discard some of the programming commands, but the command 
streamer
would still have to parse all the MI_LRI instructions in the
workaround batchbuffer.

Another solution, which this change implements, is to simply 
disregard
the user requested configuration for the period of time when 
i915/perf
is active. There is no known issue with this apart from a 
performance
penality for some media workloads that benefit from running 
on a
partially powergated GPU. We already prevent RC6 from 
affecting the
programming so it doesn't sound completely unreasonable to 
hold on
powergating for the same reason.

v2: Leave RPCS programming in intel_lrc.c (Lionel)

v3: Update for s/union intel_sseu/struct intel_sseu/ (Lionel)
       More to_intel_context() (Tvrtko)
       s/dev_priv/i915/ (Tvrtko)

Tvrtko Ursulin:

v4:
    * Rebase for make_rpcs changes.

v5:
    * Apply OA restriction from make_rpcs directly.

v6:
    * Rebase for context image setup changes.

Signed-off-by: Lionel Landwerlin 
<lionel.g.landwerlin@xxxxxxxxx>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
---
    drivers/gpu/drm/i915/i915_perf.c |  5 +++++
    drivers/gpu/drm/i915/intel_lrc.c | 30 
++++++++++++++++++++----------
    drivers/gpu/drm/i915/intel_lrc.h |  3 +++
    3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c
index ccb20230df2c..dd65b72bddd4 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1677,6 +1677,11 @@ static void 
gen8_update_reg_state_unlocked(struct i915_gem_context *ctx,
                CTX_REG(reg_state, state_offset, 
flex_regs[i], value);
        }
+
+     CTX_REG(reg_state, CTX_R_PWR_CLK_STATE, 
GEN8_R_PWR_CLK_STATE,
+             gen8_make_rpcs(dev_priv,
+ &to_intel_context(ctx,
+ dev_priv->engine[RCS])->sseu));
I think there is one issue I missed on the previous 
iterations of this
patch.

This gen8_update_reg_state_unlocked() is called when the GPU 
is parked
on the kernel context.

It's supposed to update all contexts, but I think we might 
not be able
to update the kernel context image while the GPU is using it.
The kernel context is only ever taken in extremis (you are either
parking or stalling userspace) so I don't care.

The patch exposing the RPCS configuration to userspace will 
make use of
the kernel context while OA/perf is enabled. Even if it 
reprograms the
locked value that will break the power configuration stability 
on Gen11
(because the locked configuration will be different from the 
kernel
context configuration).
Sure, but as you point out that's only on changing configuration.

What's missing in the patch is that we only bail early if the 
new sseu
matches the ce->sseu, but that doesn't necessarily match whats 
in the
context due to OA. (Or maybe I missed the conversion to rpcs 
value and
checking.)
-Chris


Yep, because the gen8_make_rpcs() post processes the values store 
at the gem context level, we risk rerunning the kernel context to 
write the exiting value.
Sorry this is all so messy :(

Lets see if I managed to follow here.

The current code indeed bails out at the set ctx param level if 
the requested state matches the ce->state. My thinking was that 
ce->state is the master state and whatever happens in "post 
processing" via gen8_make_rpcs should be hidden from it since the 
design is that the i915_perf.c will re-configure all contexts when 
the OA active status changes (to either direction).

So I don't see a problem in those two interactions.


Let's say you have contextA with sseu(slice,subslice)=(0x1/0xff) 
for ICL.

You then enable OA which locks the configuration at (0x1,0xf).

The kernel context has retained its (0x1/0xff) configuration.


And after you change the config of contextA to (0x1,0x7).


This would lead to the kernel context scheduled with (0x1,0xff) 
while OA is active.

Okay that's a problem discussed in the paragraph below - that the 
kernel context is not updated at all. But is it a problem for OA? 
Will it mess up some counters even if kernel context isn't executing 
anything interacting with them? Or is it?


What the HW is going to do to the NOA logic in power configuration 
changes is not really documented.
Experimentally on SKL GT4, it seems a change in power configuration 
will trigger a power off of everything before applying the power at 
the new configuration.
So that would imply loosing the NOA programming when we switch to the 
kernel context which means invalid values in the counters.





Apart from one, get_param_sseu will lie a bit - we can discuss 
about this one more. At one point I suggested we have two sets of 
masks in the uAPI, requested and active in a way. So userspace 
could query what it set and what is actually active.

Now second issue is if i915_perf.c is able to reprogram the kernel 
config.

Here its true, it will write to the context image and that will 
get overwritten by context save.

If that is a problem for OA, I was initially if a throw-away 
second "kernel" context could be use to re-program the real one, 
but perhaps even simpler - what about a mmio write to program the 
RPCS while kernel context is active?


Documentation says : "This register must not be programmed directly 
through CPU MMIO cycle."


Sorry :(

Ugh.. okay, help me understand if kernel context absolutely needs to 
follow the "lock" for OA to work and then we'll see what to do.

I think so.

Your idea of a throw away context to reprogramming every seems sound.

I was in the middle of refactoring the series to implement this when I 
started suspecting the need for it.

Premise of the problem statement was the kernel context doesn't get 
updated by the OA code, but, there is a loop in 
gen8_configure_all_contexts which goes through exactly all of them 
after it has idled the GPU.

AFAICS that means it is able to map the kernel context state and edit 
it so everything seems fine from this angle. (Kernel context is 
perma-pinned in software, but after idling the GPU we know it is not 
actually on the GPU so it safe to edit it's image.)

Am I missing some hole here, and if not, why does the code needs to 
have a primary update via a request in 
gen8_switch_to_updated_kernel_context? Context image after idling 
seems again sufficient.


My understanding is that the current code puts the GPU idle on the 
kernel context.
When idling stops, the GPU will save the last values from the HW 
register into the kernel context image.
We currently don't see any issue because we also run a set of commands 
on the kernel context prior to idle that mirror the contexts image edition.

We actually retire everything, and the kernel context, wait for any 
pending interrupts (context save) and check that CS reports idle. So I 
think it is actually simpler than the OA code currently does it. I just 
wanted a way to express it as a test.

That won't be the case for the RPCS register because we can't load it 
from the command streamer.


Would it be possible to exercise the hypothetical loss of NOA 
configuration from an IGT, if kernel context wasn't correctly updated?


The test configs saved in the kernel use NOA and their counters are 
progressing at different ratios from the "core clock" (RING_TIMESTAMP 
register).
In IGT tests/perf.c, the gen8_sanity_check_test_oa_reports() verify that 
the counters progress as expected.

But I can't tell from what part of the GT the test configs are sourcing 
signals from. If they do from unslice, then that won't work :(

So not testable? Shall we still apply the doctrine of "if in doubt rip 
it out"? Or too risky? Shouldn't be.. looks pretty obvious we check 
properly for idle. If we simplify the worst that can happen is that OA 
becomes glitchy paired with Gen11 media workloads which is not so 
critical, but I don't think it will break to start with.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx