Re: [PATCH 3/4] drm/msm/a5xx: fix races in preemption evaluation stage

Connor Abbott <cwabbott0@xxxxxxxxx> · Thu, 1 Aug 2024 16:57:22 +0100

On Thu, Aug 1, 2024 at 3:26 PM Vladimir Lypak <vladimir.lypak@xxxxxxxxx> wrote:
>
> On Thu, Aug 01, 2024 at 01:52:32PM +0100, Connor Abbott wrote:
> > On Thu, Aug 1, 2024 at 1:25 PM Vladimir Lypak <vladimir.lypak@xxxxxxxxx> wrote:
> > >
> > > On Mon, Jul 29, 2024 at 06:26:45PM +0100, Connor Abbott wrote:
> > > > On Thu, Jul 11, 2024 at 11:10 AM Vladimir Lypak
> > > > <vladimir.lypak@xxxxxxxxx> wrote:
> > > > >
> > > > > On A5XX GPUs when preemption is used it's invietable to enter a soft
> > > > > lock-up state in which GPU is stuck at empty ring-buffer doing nothing.
> > > > > This appears as full UI lockup and not detected as GPU hang (because
> > > > > it's not). This happens due to not triggering preemption when it was
> > > > > needed. Sometimes this state can be recovered by some new submit but
> > > > > generally it won't happen because applications are waiting for old
> > > > > submits to retire.
> > > > >
> > > > > One of the reasons why this happens is a race between a5xx_submit and
> > > > > a5xx_preempt_trigger called from IRQ during submit retire. Former thread
> > > > > updates ring->cur of previously empty and not current ring right after
> > > > > latter checks it for emptiness. Then both threads can just exit because
> > > > > for first one preempt_state wasn't NONE yet and for second one all rings
> > > > > appeared to be empty.
> > > > >
> > > > > To prevent such situations from happening we need to establish guarantee
> > > > > for preempt_trigger to be called after each submit. To implement it this
> > > > > patch adds trigger call at the end of a5xx_preempt_irq to re-check if we
> > > > > should switch to non-empty or higher priority ring. Also we find next
> > > > > ring in new preemption state "EVALUATE". If the thread that updated some
> > > > > ring with new submit sees this state it should wait until it passes.
> > > > >
> > > > > Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
> > > > > Signed-off-by: Vladimir Lypak <vladimir.lypak@xxxxxxxxx>
> > > > > ---
> > > > >  drivers/gpu/drm/msm/adreno/a5xx_gpu.c     |  6 +++---
> > > > >  drivers/gpu/drm/msm/adreno/a5xx_gpu.h     | 11 +++++++----
> > > > >  drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 24 +++++++++++++++++++----
> > > > >  3 files changed, 30 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > > > > index 6c80d3003966..266744ee1d5f 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > > > > @@ -110,7 +110,7 @@ static void a5xx_submit_in_rb(struct msm_gpu *gpu, struct msm_gem_submit *submit
> > > > >         }
> > > > >
> > > > >         a5xx_flush(gpu, ring, true);
> > > > > -       a5xx_preempt_trigger(gpu);
> > > > > +       a5xx_preempt_trigger(gpu, true);
> > > > >
> > > > >         /* we might not necessarily have a cmd from userspace to
> > > > >          * trigger an event to know that submit has completed, so
> > > > > @@ -240,7 +240,7 @@ static void a5xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
> > > > >         a5xx_flush(gpu, ring, false);
> > > > >
> > > > >         /* Check to see if we need to start preemption */
> > > > > -       a5xx_preempt_trigger(gpu);
> > > > > +       a5xx_preempt_trigger(gpu, true);
> > > > >  }
> > > > >
> > > > >  static const struct adreno_five_hwcg_regs {
> > > > > @@ -1296,7 +1296,7 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
> > > > >                 a5xx_gpmu_err_irq(gpu);
> > > > >
> > > > >         if (status & A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS) {
> > > > > -               a5xx_preempt_trigger(gpu);
> > > > > +               a5xx_preempt_trigger(gpu, false);
> > > > >                 msm_gpu_retire(gpu);
> > > > >         }
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
> > > > > index c7187bcc5e90..1120824853d4 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
> > > > > +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
> > > > > @@ -57,10 +57,12 @@ void a5xx_debugfs_init(struct msm_gpu *gpu, struct drm_minor *minor);
> > > > >   * through the process.
> > > > >   *
> > > > >   * PREEMPT_NONE - no preemption in progress.  Next state START.
> > > > > - * PREEMPT_START - The trigger is evaulating if preemption is possible. Next
> > > > > - * states: TRIGGERED, NONE
> > > > > + * PREEMPT_EVALUATE - The trigger is evaulating if preemption is possible. Next
> > > > > + * states: START, ABORT
> > > > >   * PREEMPT_ABORT - An intermediate state before moving back to NONE. Next
> > > > >   * state: NONE.
> > > > > + * PREEMPT_START - The trigger is preparing for preemption. Next state:
> > > > > + * TRIGGERED
> > > > >   * PREEMPT_TRIGGERED: A preemption has been executed on the hardware. Next
> > > > >   * states: FAULTED, PENDING
> > > > >   * PREEMPT_FAULTED: A preemption timed out (never completed). This will trigger
> > > > > @@ -71,8 +73,9 @@ void a5xx_debugfs_init(struct msm_gpu *gpu, struct drm_minor *minor);
> > > > >
> > > > >  enum preempt_state {
> > > > >         PREEMPT_NONE = 0,
> > > > > -       PREEMPT_START,
> > > > > +       PREEMPT_EVALUATE,
> > > > >         PREEMPT_ABORT,
> > > > > +       PREEMPT_START,
> > > > >         PREEMPT_TRIGGERED,
> > > > >         PREEMPT_FAULTED,
> > > > >         PREEMPT_PENDING,
> > > > > @@ -156,7 +159,7 @@ void a5xx_set_hwcg(struct msm_gpu *gpu, bool state);
> > > > >
> > > > >  void a5xx_preempt_init(struct msm_gpu *gpu);
> > > > >  void a5xx_preempt_hw_init(struct msm_gpu *gpu);
> > > > > -void a5xx_preempt_trigger(struct msm_gpu *gpu);
> > > > > +void a5xx_preempt_trigger(struct msm_gpu *gpu, bool new_submit);
> > > > >  void a5xx_preempt_irq(struct msm_gpu *gpu);
> > > > >  void a5xx_preempt_fini(struct msm_gpu *gpu);
> > > > >
> > > > > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > > > > index 67a8ef4adf6b..f8d09a83c5ae 100644
> > > > > --- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > > > > +++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > > > > @@ -87,21 +87,33 @@ static void a5xx_preempt_timer(struct timer_list *t)
> > > > >  }
> > > > >
> > > > >  /* Try to trigger a preemption switch */
> > > > > -void a5xx_preempt_trigger(struct msm_gpu *gpu)
> > > > > +void a5xx_preempt_trigger(struct msm_gpu *gpu, bool new_submit)
> > > > >  {
> > > > >         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> > > > >         struct a5xx_gpu *a5xx_gpu = to_a5xx_gpu(adreno_gpu);
> > > > >         unsigned long flags;
> > > > >         struct msm_ringbuffer *ring;
> > > > > +       enum preempt_state state;
> > > > >
> > > > >         if (gpu->nr_rings == 1)
> > > > >                 return;
> > > > >
> > > > >         /*
> > > > > -        * Try to start preemption by moving from NONE to START. If
> > > > > -        * unsuccessful, a preemption is already in flight
> > > > > +        * Try to start preemption by moving from NONE to EVALUATE. If current
> > > > > +        * state is EVALUATE/ABORT we can't just quit because then we can't
> > > > > +        * guarantee that preempt_trigger will be called after ring is updated
> > > > > +        * by new submit.
> > > > >          */
> > > > > -       if (!try_preempt_state(a5xx_gpu, PREEMPT_NONE, PREEMPT_START))
> > > > > +       state = atomic_cmpxchg(&a5xx_gpu->preempt_state, PREEMPT_NONE,
> > > > > +                              PREEMPT_EVALUATE);
> > > > > +       while (new_submit && (state == PREEMPT_EVALUATE ||
> > > > > +                             state == PREEMPT_ABORT)) {
> > > >
> > > > This isn't enough because even if new_submit is false then we may
> > > > still need to guarantee that evaluation happens. We've seen a hang in
> > > > a scenario like:
> > > >
> > > > 1. A job is submitted and executed on ring 0.
> > > > 2. A job is submitted on ring 2 while ring 0 is still active but
> > > > almost finished.
> > > > 3. The submission thread starts evaluating and sees that ring 0 is still busy.
> > > > 4. The job on ring 0 finishes and a CACHE_FLUSH IRQ is raised.
> > > > 5. The IRQ tries to trigger a preemption but the state is still
> > > > PREEMPT_EVALUATE or PREEMPT_ABORT and exits.
> > > > 6. The submission thread finishes update_wptr() and finally sets the
> > > > state to PREEMPT_NONE too late.
> > > >
> > > > Then we never preempt to ring 2 and there's a soft lockup.
> > >
> > > Thanks, i've missed that. It would need to always wait to prevent such
> > > scenario. The next patch prevented this from happening for me so i have
> > > overlooked it.
> > >
> > > Alternatively there is another approach which should perform better: to
> > > let evaluation stage run in parallel.
> > >
> > > Also i've tried serializing preemption handling on ordered workqueue and
> > > GPU kthread worker. It's a lot simpler but latency from IRQ doesn't look
> > > good:
> > >
> > >            flush-trigger    SW_IRQ-pending   flush_IRQ-trigger
> > >     uSecs    1    2    3       1    2    3       1    2    3
> > >      0-10 1515   43   65    4423   39   24     647    0    2
> > >     10-20 1484  453  103     446  414  309     399    1    1
> > >     20-40  827 1802  358      19  819  587       2   21    6
> > >     40-60    7 1264  397       1  368  329       0   30   14
> > >     60-80    4  311  115       0  181  178       0   24   12
> > >    80-120    2   36  251       0  250  188       0    9   13
> > >   120-160    0    4  244       0  176  248       0  226  150
> > >   160-200    0    1  278       0  221  235       0   86   78
> > >   200-400    0    2 1266       0 1318 1186       0  476  688
> > >   400-700    0    0  553       0  745 1028       0  150  106
> > >  700-1000    0    0  121       0  264  366       0   65   28
> > > 1000-1500    0    0   61       0  160  205       0   21    8
> > >     >2000    0    0   12       0   71   48       0    0    0
> > >
> > > 1 - current implementation but with evaluation in parallel.
> > > 2 - serialized on ordered workqueue.
> > > 3 - serialized on GPU kthread_worker.
> >
> > The problem with evaluating in parallel is that evaluating can race
> > with the rest of the process - it's possible for the thread evaluating
> > to go to be interrupted just before moving the state to PREEMPT_START
> > and in the meantime an entire preemption has happened and it's out of
> > date.
>
> Right. This gets complicated to fix for sure.
>
> >
> > What we did was to put a spinlock around the entire evaluation stage,
> > effectively replacing the busy loop and the EVALUATE stage. It unlocks
> > only after moving the state to NONE or knowing for certain that we're
> > starting a preemption. That should be lower latency than a workqueue
> > while the critical section shouldn't be that large (it's one atomic
> > operation and checking each ring), and in any case that's what the
> > spinning loop was doing anyway.
>
> Actually this is what i've done initially but i didn't want to introduce
> another spinlock.
>
> Another thing to consider is: to disable IRQs for entire trigger routine
> and use regular spin_lock instead so once it's decided to do a switch
> it won't get interrupted (when called from submit).

Isn't disabling IRQs and a regular spinlock the same as
spin_lock_irqsave()? We just used the latter, and it seems to work.
The pseudocode is something like:

spin_lock_irqsave();
if (!try_preempt_state(NONE, START)) { unlock(); return; }
if (!evaluate()) {
    set_preempt_state(ABORT);
    update_wptr();
    set_preempt_state(NONE);
    unlock();
    return;
}
unlock();
... // trigger the preemption

If the point is to have interrupts disabled for the entire time,
that's not necessary - once you unlock the state will be START and any
IRQ calling preempt_trigger() will immediately exit.

As for introducing another spinlock, this is no different than what
you're doing already with the spin loop. It's already basically a
(simple) spinlock. We're just replacing this with a real spinlock
that's simpler to reason about. However, if the spinlock is somehow
measurably slower than manually spinning, you can always keep it how
it is now with the EVALUATE state and manually add the irq
save/restore to make it safe to spin in the submit thread.

Connor

>
> Vladimir
>
> >
> > Connor
> >
> > >
> > > Vladimir
> > >
> > > >
> > > > Connor
> > > >
> > > > > +               cpu_relax();
> > > > > +               state = atomic_cmpxchg(&a5xx_gpu->preempt_state, PREEMPT_NONE,
> > > > > +                                      PREEMPT_EVALUATE);
> > > > > +       }
> > > > > +
> > > > > +       if (state != PREEMPT_NONE)
> > > > >                 return;
> > > > >
> > > > >         /* Get the next ring to preempt to */
> > > > > @@ -130,6 +142,8 @@ void a5xx_preempt_trigger(struct msm_gpu *gpu)
> > > > >                 return;
> > > > >         }
> > > > >
> > > > > +       set_preempt_state(a5xx_gpu, PREEMPT_START);
> > > > > +
> > > > >         /* Make sure the wptr doesn't update while we're in motion */
> > > > >         spin_lock_irqsave(&ring->preempt_lock, flags);
> > > > >         a5xx_gpu->preempt[ring->id]->wptr = get_wptr(ring);
> > > > > @@ -188,6 +202,8 @@ void a5xx_preempt_irq(struct msm_gpu *gpu)
> > > > >         update_wptr(gpu, a5xx_gpu->cur_ring);
> > > > >
> > > > >         set_preempt_state(a5xx_gpu, PREEMPT_NONE);
> > > > > +
> > > > > +       a5xx_preempt_trigger(gpu, false);
> > > > >  }
> > > > >
> > > > >  void a5xx_preempt_hw_init(struct msm_gpu *gpu)
> > > > > --
> > > > > 2.45.2
> > > > >