Re: [PATCH 5/5] drm/msm/dpu: rate limit snapshot capture for mmu faults

Dmitry Baryshkov <dmitry.baryshkov@xxxxxxxxxx> · Wed, 17 Jul 2024 02:10:44 +0300

On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote:
>
>
>
> On 7/16/2024 2:50 PM, Rob Clark wrote:
> > On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 7/15/2024 12:51 PM, Rob Clark wrote:
> >>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
> >>> <dmitry.baryshkov@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> >>>>> There is no recovery mechanism in place yet to recover from mmu
> >>>>> faults for DPU. We can only prevent the faults by making sure there
> >>>>> is no misconfiguration.
> >>>>>
> >>>>> Rate-limit the snapshot capture for mmu faults to once per
> >>>>> msm_kms_init_aspace() as that should be sufficient to capture
> >>>>> the snapshot for debugging otherwise there will be a lot of
> >>>>> dpu snapshots getting captured for the same fault which is
> >>>>> redundant and also might affect capturing even one snapshot
> >>>>> accurately.
> >>>>
> >>>> Please squash this into the first patch. There is no need to add code
> >>>> with a known defficiency.
> >>>>
> >>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> >>>
> >>> So, in some ways devcoredump is ratelimited by userspace needing to
> >>> clear an existing devcore..
> >>>
> >>
> >> Yes, a new devcoredump device will not be created until the previous one
> >> is consumed or times out but here I am trying to limit even the DPU
> >> snapshot capture because DPU register space is really huge and the rate
> >> at which smmu faults occur is quite fast that its causing instability
> >> while snapshots are being captured.
> >>
> >>> What I'd suggest would be more useful is to limit the devcores to once
> >>> per atomic update, ie. if display state hasn't changed, maybe an
> >>> additional devcore isn't useful
> >>>
> >>> BR,
> >>> -R
> >>>
> >>
> >> By display state change, do you mean like the checks we have in
> >> drm_atomic_crtc_needs_modeset()?
> >>
> >> OR do you mean we need to cache the previous (currently picked up by hw)
> >> state and trigger a new devcores if the new state is different by
> >> comparing more things?
> >>
> >> This will help to reduce the snapshots to unique frame updates but I do
> >> not think it will reduce the rate enough for the case where DPU did not
> >> recover from the previous fault.
> >
> > I was thinking the easy thing, of just resetting the counter in
> > msm_atomic_commit_tail().. I suppose we could be clever filter out
> > updates that only change scanout address.  Or hash the atomic state
> > and only generate devcoredumps for unique states.  But I'm not sure
> > how over-complicated we should make this.
> >
> > BR,
> > -R
>
> Its a good idea actually and I would also like to keep it simple :)
>
> One question, is it okay to assume that all compositors will only issue
> unique commits? Because we are assuming thats the case with resetting
> the counter in msm_atomic_commit_tail().

The compositors use drm_mode_atomic_ioctl() which allocates a new
state each time. It is impossible to re-submit the same
drm_atomic_state from userspace.

-- 
With best wishes
Dmitry