Re: [PATCH 5/5] drm/msm/dpu: rate limit snapshot capture for mmu faults

Dmitry Baryshkov <dmitry.baryshkov@xxxxxxxxxx> · Wed, 17 Jul 2024 02:42:47 +0300

On Wed, 17 Jul 2024 at 02:15, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote:
>
>
>
> On 7/16/2024 4:10 PM, Dmitry Baryshkov wrote:
> > On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 7/16/2024 2:50 PM, Rob Clark wrote:
> >>> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/15/2024 12:51 PM, Rob Clark wrote:
> >>>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
> >>>>> <dmitry.baryshkov@xxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> >>>>>>> There is no recovery mechanism in place yet to recover from mmu
> >>>>>>> faults for DPU. We can only prevent the faults by making sure there
> >>>>>>> is no misconfiguration.
> >>>>>>>
> >>>>>>> Rate-limit the snapshot capture for mmu faults to once per
> >>>>>>> msm_kms_init_aspace() as that should be sufficient to capture
> >>>>>>> the snapshot for debugging otherwise there will be a lot of
> >>>>>>> dpu snapshots getting captured for the same fault which is
> >>>>>>> redundant and also might affect capturing even one snapshot
> >>>>>>> accurately.
> >>>>>>
> >>>>>> Please squash this into the first patch. There is no need to add code
> >>>>>> with a known defficiency.
> >>>>>>
> >>>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> >>>>>
> >>>>> So, in some ways devcoredump is ratelimited by userspace needing to
> >>>>> clear an existing devcore..
> >>>>>
> >>>>
> >>>> Yes, a new devcoredump device will not be created until the previous one
> >>>> is consumed or times out but here I am trying to limit even the DPU
> >>>> snapshot capture because DPU register space is really huge and the rate
> >>>> at which smmu faults occur is quite fast that its causing instability
> >>>> while snapshots are being captured.
> >>>>
> >>>>> What I'd suggest would be more useful is to limit the devcores to once
> >>>>> per atomic update, ie. if display state hasn't changed, maybe an
> >>>>> additional devcore isn't useful
> >>>>>
> >>>>> BR,
> >>>>> -R
> >>>>>
> >>>>
> >>>> By display state change, do you mean like the checks we have in
> >>>> drm_atomic_crtc_needs_modeset()?
> >>>>
> >>>> OR do you mean we need to cache the previous (currently picked up by hw)
> >>>> state and trigger a new devcores if the new state is different by
> >>>> comparing more things?
> >>>>
> >>>> This will help to reduce the snapshots to unique frame updates but I do
> >>>> not think it will reduce the rate enough for the case where DPU did not
> >>>> recover from the previous fault.
> >>>
> >>> I was thinking the easy thing, of just resetting the counter in
> >>> msm_atomic_commit_tail().. I suppose we could be clever filter out
> >>> updates that only change scanout address.  Or hash the atomic state
> >>> and only generate devcoredumps for unique states.  But I'm not sure
> >>> how over-complicated we should make this.
> >>>
> >>> BR,
> >>> -R
> >>
> >> Its a good idea actually and I would also like to keep it simple :)
> >>
> >> One question, is it okay to assume that all compositors will only issue
> >> unique commits? Because we are assuming thats the case with resetting
> >> the counter in msm_atomic_commit_tail().
> >
> > The compositors use drm_mode_atomic_ioctl() which allocates a new
> > state each time. It is impossible to re-submit the same
> > drm_atomic_state from userspace.
> >
>
> No, what I meant was, is it okay to assume that a commit is issued only
> when display configuration has changed?

No.

> Like if we get multiple commits for the same frame for some reason,
> thats also spam and this approach will not avoid that.

I'd say, new commit means that userspace thinks that something
changed. So if the driver got another hang / issue / etc, it should
try capturing a new state.

-- 
With best wishes
Dmitry