Re: [PATCH] drm/i915: Increase the busyspin durations for i915_wait_request

Sagar Arun Kamble <sagar.a.kamble@xxxxxxxxx> · Mon, 23 Oct 2017 14:35:59 +0530

On 9/15/2017 3:48 PM, Chris Wilson wrote:
Quoting Tvrtko Ursulin (2017-09-15 11:01:02)
On 14/09/2017 10:58, Chris Wilson wrote:
An interesting discussion regarding "hybrid interrupt polling" for NVMe
came to the conclusion that the ideal busyspin before sleeping was half
of the expected request latency (and better if it was already halfway
through that request). This suggested that we too should look again at
our tradeoff between spinning and waiting. Currently, our spin simply
tries to hide the cost of enabling the interrupt, which is good to avoid
penalising nop requests (i.e. test throughput) and not much else.
Studying real world workloads suggests that a spin of upto 500us can
What workloads and and power/perf testing?
Classic glxgears ;) People on the cc I trust to give a solid yay/nay.
Perf runs on SKL show improvements on some of the CL and media 
benchmarks without impacting power.
On APL, perf is not impacted but core power increase is observed. This 
observation is present w/ and w/o GuC.
More benefit of this approach with GuC can be observed by disabling 
RC6/CPG. I had tested xcompbench offscreen benchmark
with GuC on APL with this change and RC6 disabled and had observed 
substantial improvements.
dramatically boost performance, but the suggestion is that this is not
from avoiding interrupt latency per-se, but from secondary effects of
sleeping such as allowing the CPU reduce cstate and context switch away.
Maybe the second part of the sentence would be clearer if not in a way
in inverted form. Like longer spin = more performance = less sleeping =
less cstate switching? Or just add "but from _avoiding_ secondary
effects of sleeping"?

To offset those costs from penalising the active client, bump the initial
spin somewhat to 250us and the secondary spin to 20us to balance the cost
of another context switch following the interrupt.

Suggested-by: Sagar Kamble <sagar.a.kamble@xxxxxxxxx>
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Sagar Kamble <sagar.a.kamble@xxxxxxxxx>
Cc: Eero Tamminen <eero.t.tamminen@xxxxxxxxx>
Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
Cc: Ben Widawsky <ben@xxxxxxxxxxxx>
Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
The change looks good to me (w.r.t old drm-tip)
Reviewed-by: Sagar Arun Kamble <sagar.a.kamble@xxxxxxxxx>
---
   drivers/gpu/drm/i915/i915_gem_request.c | 25 +++++++++++++++++++++----
   1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 813a3b546d6e..ccbdaf6a0e4d 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1155,8 +1155,20 @@ long i915_wait_request(struct drm_i915_gem_request *req,
       GEM_BUG_ON(!intel_wait_has_seqno(&wait));
       GEM_BUG_ON(!i915_sw_fence_signaled(&req->submit));
   
-     /* Optimistic short spin before touching IRQs */
-     if (i915_spin_request(req, state, 5))
+     /* Optimistic short spin before touching IRQs.
So it's not short any more. "Optimistic busy spin" ?

+      *
+      * We use a rather large value here to offset the penalty of switching
+      * away from the active task. Frequently, the client will wait upon
+      * an old swapbuffer to throttle itself to remain within a frame of
+      * the gpu. If the client is running in lockstep with the gpu, then
+      * it should not be waiting long at all, and a sleep now will incur
+      * extra scheduler latency in producing the next frame. So we sleep
+      * for longer to try and keep the client running.
+      *
250us sounds quite long and worrying to me.
Yes. It's the sort of level where we are applying an artificial load to
the CPU to encourage turbo...

20us, I'm happier with; that does tally well with the latencies we can
measure from interrupt to process. 250us and we're well into secondary
effects that we should not be messing with (at least not like this).

Otoh, at 250us it might be interesting to play with monitor/mwait.
  
In the waiting on swapbuffer case, what are the clients waiting for? GPU
rendering to finish or previous vblank or something?
It's throttling the swap queue, so previous frame.
  
I am thinking if it would be possible to add a special API just for this
sort of waits and internally know how long it is likely to take. So then
decide based on that whether to spin or sleep. Like next vblank is
coming in 5ms, no point in busy spinning or something like that.
We have one; THROTTLE. For the user, it is too imprecise (we could
argument it with a timeout parameter, but...) as it doesn't wait for an
explicit event. GL ideally tries to ensure the render pipeline is full
but never more than a completed frame in advance. (That pipeline depth
could be specified by the client.) You've seen the same effect in wsim,
where if you just let a client fill up 20ms of batches, that can block
the GPU for seconds, but again a limit upon client batches enforces rr.
We've left it to the user to cooperate and balance their own pipeline;
the old scheduler enforced it at the ABI level, but it ideally just uses
a CFS-esque method, leaving the judgement of pipeline depth to the user.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx