Re: [PATCH] drm/i915: Reduce i915_request.lock contention for i915_request_wait

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 15 Jul 2020 15:47:50 +0100



Quoting Chris Wilson (2020-07-15 15:47:15)
> Quoting Tvrtko Ursulin (2020-07-15 13:26:23)
> > 
> > On 15/07/2020 13:06, Tvrtko Ursulin wrote:
> > > 
> > > On 15/07/2020 11:50, Chris Wilson wrote:
> > >> Currently, we use i915_request_completed() directly in
> > >> i915_request_wait() and follow up with a manual invocation of
> > >> dma_fence_signal(). This appears to cause a large number of contentions
> > >> on i915_request.lock as when the process is woken up after the fence is
> > >> signaled by an interrupt, we will then try and call dma_fence_signal()
> > >> ourselves while the signaler is still holding the lock.
> > >> dma_fence_is_signaled() has the benefit of checking the
> > >> DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
> > >> avoids most of that contention.
> > >>
> > >> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > >> Cc: Matthew Auld <matthew.auld@xxxxxxxxx>
> > >> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx>
> > >> ---
> > >>   drivers/gpu/drm/i915/i915_request.c | 12 ++++--------
> > >>   1 file changed, 4 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/drivers/gpu/drm/i915/i915_request.c 
> > >> b/drivers/gpu/drm/i915/i915_request.c
> > >> index 0b2fe55e6194..bb4eb1a8780e 100644
> > >> --- a/drivers/gpu/drm/i915/i915_request.c
> > >> +++ b/drivers/gpu/drm/i915/i915_request.c
> > >> @@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, 
> > >> unsigned int cpu)
> > >>       return this_cpu != cpu;
> > >>   }
> > >> -static bool __i915_spin_request(const struct i915_request * const rq, 
> > >> int state)
> > >> +static bool __i915_spin_request(struct i915_request * const rq, int 
> > >> state)
> > >>   {
> > >>       unsigned long timeout_ns;
> > >>       unsigned int cpu;
> > >> @@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct 
> > >> i915_request * const rq, int state)
> > >>       timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
> > >>       timeout_ns += local_clock_ns(&cpu);
> > >>       do {
> > >> -        if (i915_request_completed(rq))
> > >> +        if (dma_fence_is_signaled(&rq->fence))
> > >>               return true;
> > >>           if (signal_pending_state(state, current))
> > >> @@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
> > >>        * duration, which we currently lack.
> > >>        */
> > >>       if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
> > >> -        __i915_spin_request(rq, state)) {
> > >> -        dma_fence_signal(&rq->fence);
> > >> +        __i915_spin_request(rq, state))
> > >>           goto out;
> > >> -    }
> > >>       /*
> > >>        * This client is about to stall waiting for the GPU. In many cases
> > >> @@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
> > >>       for (;;) {
> > >>           set_current_state(state);
> > >> -        if (i915_request_completed(rq)) {
> > >> -            dma_fence_signal(&rq->fence);
> > >> +        if (dma_fence_is_signaled(&rq->fence))
> > >>               break;
> > >> -        }
> > >>           intel_engine_flush_submission(rq->engine);
> > >>
> > > 
> > > In other words putting some latency back into the waiters, which is 
> > > probably okay, since sync waits is not our primary model.
> > > 
> > > I have a slight concern about the remaining value of busy spinning if 
> > > i915_request_completed check is removed from there as well. Of course it 
> > > doesn't make sense to have different completion criteria between the 
> > > two.. We could wait a bit longer if real check in busyspin said request 
> > > is actually completed, just not signal it but wait for the breadcrumbs 
> > > to do it.
> > 
> > What a load of nonsense.. :)
> > 
> > Okay, I think the only real question is i915_request_completed vs 
> > dma_fence_signaled in __i915_spin_request. Do we want to burn CPU cycles 
> > waiting on GPU and breadcrumb irq work, or just the GPU.
> 
> dma_fence_is_signaled() {
>         if (test_bit(SIGNALED_BIT))
>                 return true;
>         
>         if (i915_request_completed()) {
>                 dma_fence_signal();
>                 return true;
>         }
>         
>         return false;
> }
> 
> with the indirection. So the question is whether the indirection is
> worth the extra test bit. Just purely looking at the i915_request.lock
> contention suggests that it probably is. For the spinner, burning a few
> extra CPU cycles for *vfunc is not an issue, it's the wakeup latency,
> and since we are calling dma_fence_signal() upon wakeup we do take the
> spinlock without checking for an early return from the SIGNALED_BIT.
> So I think it's a net positive. The alternative was to write
> 
>         if (i915_request_completed()) {
>                 if (!i915_request_is_signaled())
>                         dma_fence_signal();
>                 break;
>         }
> 
> but
> 
>         if (dma_fence_is_signaled())
>                 break;
> 
> does appear simpler, if only by virtue of hiding the details in an
> inline.

Or a patch to add test_bit() to dma_fence_signal, but this looked
better.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx