Patch "drm/msm/a5xx: fix races in preemption evaluation stage" has been added to the 4.19-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Mon, 30 Sep 2024 20:44:57 -0400

This is a note to let you know that I've just added the patch titled

    drm/msm/a5xx: fix races in preemption evaluation stage

to the 4.19-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     drm-msm-a5xx-fix-races-in-preemption-evaluation-stag.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 832db2841955830c1ed6108d670b8dad1b5f7f0f
Author: Vladimir Lypak <vladimir.lypak@xxxxxxxxx>
Date:   Sun Sep 1 13:54:02 2024 +0000

    drm/msm/a5xx: fix races in preemption evaluation stage
    
    [ Upstream commit ce050f307ad93bcc5958d0dd35fc276fd394d274 ]
    
    On A5XX GPUs when preemption is used it's invietable to enter a soft
    lock-up state in which GPU is stuck at empty ring-buffer doing nothing.
    This appears as full UI lockup and not detected as GPU hang (because
    it's not). This happens due to not triggering preemption when it was
    needed. Sometimes this state can be recovered by some new submit but
    generally it won't happen because applications are waiting for old
    submits to retire.
    
    One of the reasons why this happens is a race between a5xx_submit and
    a5xx_preempt_trigger called from IRQ during submit retire. Former thread
    updates ring->cur of previously empty and not current ring right after
    latter checks it for emptiness. Then both threads can just exit because
    for first one preempt_state wasn't NONE yet and for second one all rings
    appeared to be empty.
    
    To prevent such situations from happening we need to establish guarantee
    for preempt_trigger to make decision after each submit or retire. To
    implement this we serialize preemption initiation using spinlock. If
    switch is already in progress we need to re-trigger preemption when it
    finishes.
    
    Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
    Signed-off-by: Vladimir Lypak <vladimir.lypak@xxxxxxxxx>
    Patchwork: https://patchwork.freedesktop.org/patch/612045/
    Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
index 7d71860c4bee6..c9b1da1517dc2 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.h
@@ -44,6 +44,7 @@ struct a5xx_gpu {
 	uint64_t preempt_iova[MSM_GPU_MAX_RINGS];
 
 	atomic_t preempt_state;
+	spinlock_t preempt_start_lock;
 	struct timer_list preempt_timer;
 };
 
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
index d6dc4168558e0..63445e88f8adc 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
@@ -107,12 +107,19 @@ void a5xx_preempt_trigger(struct msm_gpu *gpu)
 	if (gpu->nr_rings == 1)
 		return;
 
+	/*
+	 * Serialize preemption start to ensure that we always make
+	 * decision on latest state. Otherwise we can get stuck in
+	 * lower priority or empty ring.
+	 */
+	spin_lock_irqsave(&a5xx_gpu->preempt_start_lock, flags);
+
 	/*
 	 * Try to start preemption by moving from NONE to START. If
 	 * unsuccessful, a preemption is already in flight
 	 */
 	if (!try_preempt_state(a5xx_gpu, PREEMPT_NONE, PREEMPT_START))
-		return;
+		goto out;
 
 	/* Get the next ring to preempt to */
 	ring = get_next_ring(gpu);
@@ -137,9 +144,11 @@ void a5xx_preempt_trigger(struct msm_gpu *gpu)
 		set_preempt_state(a5xx_gpu, PREEMPT_ABORT);
 		update_wptr(gpu, a5xx_gpu->cur_ring);
 		set_preempt_state(a5xx_gpu, PREEMPT_NONE);
-		return;
+		goto out;
 	}
 
+	spin_unlock_irqrestore(&a5xx_gpu->preempt_start_lock, flags);
+
 	/* Make sure the wptr doesn't update while we're in motion */
 	spin_lock_irqsave(&ring->lock, flags);
 	a5xx_gpu->preempt[ring->id]->wptr = get_wptr(ring);
@@ -163,6 +172,10 @@ void a5xx_preempt_trigger(struct msm_gpu *gpu)
 
 	/* And actually start the preemption */
 	gpu_write(gpu, REG_A5XX_CP_CONTEXT_SWITCH_CNTL, 1);
+	return;
+
+out:
+	spin_unlock_irqrestore(&a5xx_gpu->preempt_start_lock, flags);
 }
 
 void a5xx_preempt_irq(struct msm_gpu *gpu)
@@ -200,6 +213,12 @@ void a5xx_preempt_irq(struct msm_gpu *gpu)
 	update_wptr(gpu, a5xx_gpu->cur_ring);
 
 	set_preempt_state(a5xx_gpu, PREEMPT_NONE);
+
+	/*
+	 * Try to trigger preemption again in case there was a submit or
+	 * retire during ring switch
+	 */
+	a5xx_preempt_trigger(gpu);
 }
 
 void a5xx_preempt_hw_init(struct msm_gpu *gpu)
@@ -302,5 +321,6 @@ void a5xx_preempt_init(struct msm_gpu *gpu)
 		}
 	}
 
+	spin_lock_init(&a5xx_gpu->preempt_start_lock);
 	timer_setup(&a5xx_gpu->preempt_timer, a5xx_preempt_timer, 0);
 }