Re: [RFC 4/4] drm/i915: Expose RPCS (SSEU) configuration to userspace

Oscar Mateo <oscar.mateo@xxxxxxxxx> · Tue, 2 May 2017 15:00:27 +0000

On 05/02/2017 07:55 PM, Chris Wilson wrote:
On Tue, May 02, 2017 at 10:33:19AM +0000, Oscar Mateo wrote:

On 05/02/2017 11:49 AM, Chris Wilson wrote:
We want to allow userspace to reconfigure the subslice configuration for
its own use case. To do so, we expose a context parameter to allow
adjustment of the RPCS register stored within the context image (and
currently not accessible via LRI).
Userspace could also do this by themselves via LRI if we simply
whitelist GEN8_R_PWR_CLK_STATE.

Hardware people suggested this programming model:

- PIPECONTROL - Stalling flish, flush all caches (color, depth, DC$)
- LOAD_REGISTER_IMMEDIATE - R_PWR_CLK_STATE
- Reprogram complete state
Hmm, treating it as a complete state wipe is a nuisance, but fairly
trivial. The simplest way will be for the user to execute the LRI batch
as part of creating the context. But there will be some use cases where
dynamic reconfiguration within an active context will be desired, I'm
sure.

Exactly, in this way the UMD gets the best of both worlds: they can do 
the LRI once and forget about it, or they can reconfigure on-demand.

If the context is adjusted before
first use, the adjustment is for "free"; otherwise if the context is
active we flush the context off the GPU (stalling all users) and forcing
the GPU to save the context to memory where we can modify it and so
ensure that the register is reloaded on next execution.
There is another cost associated with the adjustment: slice poweron
and shutdown do take some time to happen (in the order of tens of
usecs). I have been playing with an i-g-t benchmark to measure this
delay, I'll send it to the mailing list.
Hmm, I thought the argument for why selecting smaller subslices gave
better performance was that it was restoring the whole set between
contexts, even when the configuration between contexts was the same.

Hmmm... it's the first time I hear that particular argument. I can 
definitely see the delay when changing the configuration (also, powering 
slices on takes a little bit more than switching them down) but no 
difference when I am just switching between contexts with the same 
configuration.
Until now, the most convincing argument I've heard is that thread 
scheduling is much more efficient with just one slice when you don't 
really need more, but maybe that doesn't explain the whole picture.

As always numbers demonstrating the advantage, perhaps explaining why
it helps, and also for spotting when we break it are most welcome :)
-Chris

I can provide numbers for the slice configuration delay (numbers that 
have to be taken into account by the UMD when deciding which 
configuration to use) but I think Dimitry is in a better position to 
provide numbers for the advantage.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx