On Thu, Mar 17, 2016 at 12:59:46PM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > > By reading the CSB (slow MMIO accesses) into a temporary local > buffer we can decrease the duration of holding the execlist > lock. > > Main advantage is that during heavy batch buffer submission we > reduce the execlist lock contention, which should decrease the > latency and CPU usage between the submitting userspace process > and interrupt handling. > > Downside is that we need to grab and relase the forcewake twice, > but as the below numbers will show this is completely hidden > by the primary gains. > > Testing with "gem_latency -n 100" (submit batch buffers with a > hundred nops each) shows more than doubling of the throughput > and more than halving of the dispatch latency, overall latency > and CPU time spend in the submitting process. > > Submitting empty batches ("gem_latency -n 0") does not seem > significantly affected by this change with throughput and CPU > time improving by half a percent, and overall latency worsening > by the same amount. > > Above tests were done in a hundred runs on a big core Broadwell. > > v2: > * Overflow protection to local CSB buffer. > * Use closer dev_priv in execlists_submit_requests. (Chris Wilson) > > v3: Rebase. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > --- > drivers/gpu/drm/i915/intel_lrc.c | 77 ++++++++++++++++++++-------------------- > 1 file changed, 38 insertions(+), 39 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index f72782200226..c6b3a9d19af2 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -416,15 +416,23 @@ static void execlists_update_context(struct drm_i915_gem_request *rq) > static void execlists_submit_requests(struct drm_i915_gem_request *rq0, > struct drm_i915_gem_request *rq1) > { > + struct drm_i915_private *dev_priv = rq0->i915; > + > execlists_update_context(rq0); > > if (rq1) > execlists_update_context(rq1); > > + spin_lock(&dev_priv->uncore.lock); > + intel_uncore_forcewake_get__locked(dev_priv, FORCEWAKE_ALL); My only problem with this is that it puts the onus on the caller to remember to disable irqs first. (That would be a silent conflict with the patch to move the submission to a kthread for example.) What's the cost of being upfront with spin_lock_irqsave() here? /* BUG_ON(!irqs_disabled()); */ ? I'm quite happy to go with the comment here and since it doesn't make the lockups any worse, Reviewed-by: Chris Wilsno <chris@xxxxxxxxxxxxxxxxxx> -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx