Re: [PATCH 5/5] drm/msm/dpu: rate limit snapshot capture for mmu faults

Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> · Tue, 16 Jul 2024 14:25:54 -0700

On 7/1/2024 12:43 PM, Dmitry Baryshkov wrote:
On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
There is no recovery mechanism in place yet to recover from mmu
faults for DPU. We can only prevent the faults by making sure there
is no misconfiguration.

Rate-limit the snapshot capture for mmu faults to once per
msm_kms_init_aspace() as that should be sufficient to capture
the snapshot for debugging otherwise there will be a lot of
dpu snapshots getting captured for the same fault which is
redundant and also might affect capturing even one snapshot
accurately.

Please squash this into the first patch. There is no need to add code
with a known defficiency.


Sure, will squash it.

Also, is there a reason why you haven't used <linux/ratelimit.h> ?


There is really no interval I can conclude on which is safe here. In 
fact rate-limit is probably not the right terminology here.

I should probably just rename this to once per init_aspace() which is 
essentially once per bootup.

I couldnt come up with a better limiter because ideally if we had a 
recovery we should reset the counter there.

Similar to other DPU errors like underrun and ping-pong timeouts (which 
capture the snapshot once per suspend/resume) , I just kept it to once 
per init_aspace().

smmu faults happen at a pretty rapid rate and capturing the full DPU 
snapshot each time was redundant. So I thought atleast once should be 
enough.


Signed-off-by: Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx>
---
  drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
  drivers/gpu/drm/msm/msm_kms.h | 3 +++
  2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
index d5d3117259cf..90a333920c01 100644
--- a/drivers/gpu/drm/msm/msm_kms.c
+++ b/drivers/gpu/drm/msm/msm_kms.c
@@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
  {
  	struct msm_kms *kms = arg;
  
-	msm_disp_snapshot_state(kms->dev);
+	if (!kms->fault_snapshot_capture) {
+		msm_disp_snapshot_state(kms->dev);
+		kms->fault_snapshot_capture++;

When is it decremented?


It is not because it will only increment once in a bootup, I can switch 
this to a bool since it will happen only once unless we conclude on a 
better way.

+	}
  
  	return -ENOSYS;
  }
@@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
  		mmu->funcs->destroy(mmu);
  	}
  
+	kms->fault_snapshot_capture = 0;
  	msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
  
  	return aspace;
diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
index 1e0c54de3716..240b39e60828 100644
--- a/drivers/gpu/drm/msm/msm_kms.h
+++ b/drivers/gpu/drm/msm/msm_kms.h
@@ -134,6 +134,9 @@ struct msm_kms {
  	int irq;
  	bool irq_requested;
  
+	/* rate limit the snapshot capture to once per attach */
+	int fault_snapshot_capture;
+
  	/* mapper-id used to request GEM buffer mapped for scanout: */
  	struct msm_gem_address_space *aspace;
  
--
2.44.0