On Wed, 17 Jul 2024 at 02:15, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > > > > On 7/16/2024 4:10 PM, Dmitry Baryshkov wrote: > > On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > >> > >> > >> > >> On 7/16/2024 2:50 PM, Rob Clark wrote: > >>> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 7/15/2024 12:51 PM, Rob Clark wrote: > >>>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov > >>>>> <dmitry.baryshkov@xxxxxxxxxx> wrote: > >>>>>> > >>>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote: > >>>>>>> There is no recovery mechanism in place yet to recover from mmu > >>>>>>> faults for DPU. We can only prevent the faults by making sure there > >>>>>>> is no misconfiguration. > >>>>>>> > >>>>>>> Rate-limit the snapshot capture for mmu faults to once per > >>>>>>> msm_kms_init_aspace() as that should be sufficient to capture > >>>>>>> the snapshot for debugging otherwise there will be a lot of > >>>>>>> dpu snapshots getting captured for the same fault which is > >>>>>>> redundant and also might affect capturing even one snapshot > >>>>>>> accurately. > >>>>>> > >>>>>> Please squash this into the first patch. There is no need to add code > >>>>>> with a known defficiency. > >>>>>> > >>>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ? > >>>>> > >>>>> So, in some ways devcoredump is ratelimited by userspace needing to > >>>>> clear an existing devcore.. > >>>>> > >>>> > >>>> Yes, a new devcoredump device will not be created until the previous one > >>>> is consumed or times out but here I am trying to limit even the DPU > >>>> snapshot capture because DPU register space is really huge and the rate > >>>> at which smmu faults occur is quite fast that its causing instability > >>>> while snapshots are being captured. > >>>> > >>>>> What I'd suggest would be more useful is to limit the devcores to once > >>>>> per atomic update, ie. if display state hasn't changed, maybe an > >>>>> additional devcore isn't useful > >>>>> > >>>>> BR, > >>>>> -R > >>>>> > >>>> > >>>> By display state change, do you mean like the checks we have in > >>>> drm_atomic_crtc_needs_modeset()? > >>>> > >>>> OR do you mean we need to cache the previous (currently picked up by hw) > >>>> state and trigger a new devcores if the new state is different by > >>>> comparing more things? > >>>> > >>>> This will help to reduce the snapshots to unique frame updates but I do > >>>> not think it will reduce the rate enough for the case where DPU did not > >>>> recover from the previous fault. > >>> > >>> I was thinking the easy thing, of just resetting the counter in > >>> msm_atomic_commit_tail().. I suppose we could be clever filter out > >>> updates that only change scanout address. Or hash the atomic state > >>> and only generate devcoredumps for unique states. But I'm not sure > >>> how over-complicated we should make this. > >>> > >>> BR, > >>> -R > >> > >> Its a good idea actually and I would also like to keep it simple :) > >> > >> One question, is it okay to assume that all compositors will only issue > >> unique commits? Because we are assuming thats the case with resetting > >> the counter in msm_atomic_commit_tail(). > > > > The compositors use drm_mode_atomic_ioctl() which allocates a new > > state each time. It is impossible to re-submit the same > > drm_atomic_state from userspace. > > > > No, what I meant was, is it okay to assume that a commit is issued only > when display configuration has changed? No. > Like if we get multiple commits for the same frame for some reason, > thats also spam and this approach will not avoid that. I'd say, new commit means that userspace thinks that something changed. So if the driver got another hang / issue / etc, it should try capturing a new state. -- With best wishes Dmitry