Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency

Qiang Yu <yuq825@xxxxxxxxx> · Tue, 23 Jan 2024 09:18:02 +0800



On Sun, Jan 21, 2024 at 11:11 PM Erico Nunes <nunes.erico@xxxxxxxxx> wrote:
>
> On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825@xxxxxxxxx> wrote:
> >
> > On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@xxxxxxxx> wrote:
> > >
> > > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@xxxxxxxxx>
> > > >
> > > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > > >       struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> > > >       struct lima_sched_task *task = to_lima_task(job);
> > > >       struct lima_device *ldev = pipe->ldev;
> > > > +     struct lima_ip *ip = pipe->processor[0];
> > > > +
> > > > +     /*
> > > > +      * If the GPU managed to complete this jobs fence, the timeout is
> > > > +      * spurious. Bail out.
> > > > +      */
> > > > +     if (dma_fence_is_signaled(task->done_fence)) {
> > > > +             DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > > > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > > > +     }
> > >
> > > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
> > > and stop selling bandaid like this because you have options like locating
> > > the reasons behind timeout.
> >
> > This chang do look like to aim for 2FPS apps. Maybe 500ms is too short
> > for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses
> > 10s timeout. So increasing the timeout seems to be an equivalent and better
> > way?
>
> Indeed 500ms might be too optimistic for the sort of applications that
> users expect to run on this hardware currently. For a more similar
> reference though, other embedded drivers like v3d and panfrost do
> still set it to 500ms. Note that this patch is just exactly the same
> as exists in Panfrost today and was already discussed with some common
> arguments in the patches of this series:
> https://patchwork.freedesktop.org/series/120820/
>
> But I would agree to bump the timeout to a higher value for lima. Some
> distributions are already doing this with module parameters anyway to
> even be able to run some more demanding application stacks on a Mali
> 400.
>
> Another thing we might consider (probably in a followup patchset to
> not delay these fixes forever for the people hitting this issue) is to
> configure the Mali hardware watchdog to the value that we would like
> as a timeout. That way we would get timeout jobs going through the
> same error irq path as other hardware error jobs and might be able to
> delete(?)/simplify this software timeout code.
>
This way should be much simpler and stable.

> In the meantime for v2 of this series I'll make the change to account
> for the multiple pp irqs. So do you agree it is ok to leave
> drm_sched_stop() as it is at least for this series?
>
I'm OK with this.

> Thanks all for the reviews
>
> Erico