On Wed, 21 Feb 2024 at 01:04, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > > > > On 2/20/2024 2:42 PM, Dmitry Baryshkov wrote: > > On Wed, 21 Feb 2024 at 00:40, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > >> > >> > >> > >> On 2/19/2024 3:52 AM, Dmitry Baryshkov wrote: > >>> On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote: > >>>>> On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk@xxxxxxxxxxx> wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote: > >>>>>>> We have several reports of vblank timeout messages. However after some > >>>>>>> debugging it was found that there might be different causes to that. > >>>>>>> To allow us to identify the DPU block that gets stuck, include the > >>>>>>> actual CTL_FLUSH value into the timeout message and trigger the devcore > >>>>>>> snapshot capture. > >>>>>>> > >>>>>>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@xxxxxxxxxx> > >>>>>>> --- > >>>>>>> Changes in v2: > >>>>>>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump > >>>>>>> (Abhinav) > >>>>>>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@xxxxxxxxxx > >>>>>>> --- > >>>>>>> drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++- > >>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>>>>>> > >>>>>>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>>>> index d0f56c5c4cce..a8d6165b3c0a 100644 > >>>>>>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>>>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c > >>>>>>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done( > >>>>>>> (hw_ctl->ops.get_flush_register(hw_ctl) == 0), > >>>>>>> msecs_to_jiffies(50)); > >>>>>>> if (ret <= 0) { > >>>>>>> - DPU_ERROR("vblank timeout\n"); > >>>>>>> + DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl)); > >>>>>>> + msm_disp_snapshot_state(phys_enc->parent->dev); > >>>>>> > >>>>>> > >>>>>> There is no rate limiting in this piece of code unfortunately. So this > >>>>>> will flood the number of snapshots. > >>>>> > >>>>> Well... Yes and no. The devcoredump will destroy other snapshots if > >>>>> there is a pending one. So only the console will be flooded and only > >>>>> in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled. > >>>>> > >>>> > >>>> Yes, true but at the same time this makes it hard to capture a good dump > >>>> as potentially every vblank you could timeout so this destroy/create > >>>> cycle wont end. > >>> > >>> Excuse me, maybe I miss something. On the first timeout the snapshot > >>> is created. It is held by the kernel until it is fully read out from > >>> the userspace. Other snapshots will not interfere with this snapshot. > >>> > >> > >> For every new snapshot a new devcoredump device will be created which > >> should remain till it has been read. But now this will be created every > >> blank. IMO, this is really too much data for no reason. > > > > No-no-no. If there is a devcoredump for a device, the next one will > > not be created. See dev_coredumpm(). > > So all the snapshots will be created and then destroyed immediately. > > > > hmm ... I have certainly seen devcd_count go higher than one (but not > more than 2). I am wondering whether this was because of some race > condition of the previous destroy / new create. > > But anyway, this part is clear now. thanks. > > >> > >> Subsequent vblank timeouts are not going to give any new information > >> compared to the existing snapshot of the first vblank timeout thats why > >> we should just create the snapshot when the first error happens and stop. > >> > >> For other frame done timeouts, infact subsequent timeouts without any > >> sort of recovery in between are quite misleading because hardware was > >> already not able to fetch the previous frame so it will most likely not > >> fetch the next one either till it has recovered. Typically thats why > >> these vblank timeouts happen in a flurry as the hardware never really > >> recovered from the first timeout. > >> > >>> Or are you worried that snapshotting takes time, so taking a snapshot > >>> will also interfere with the vblank timings for the next vblank? > >>> > >> > >> Yes this is another point. > > > > snapshots will still be captured every vblank timeout and reading > through the entire DPU reg space every vblank timeout is certainly > something we can avoid. Ack. -- With best wishes Dmitry