On Sun, Jan 21, 2024 at 11:11 PM Erico Nunes <nunes.erico@xxxxxxxxx> wrote: > > On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825@xxxxxxxxx> wrote: > > > > On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@xxxxxxxx> wrote: > > > > > > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@xxxxxxxxx> > > > > > > > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job > > > > struct lima_sched_pipe *pipe = to_lima_pipe(job->sched); > > > > struct lima_sched_task *task = to_lima_task(job); > > > > struct lima_device *ldev = pipe->ldev; > > > > + struct lima_ip *ip = pipe->processor[0]; > > > > + > > > > + /* > > > > + * If the GPU managed to complete this jobs fence, the timeout is > > > > + * spurious. Bail out. > > > > + */ > > > > + if (dma_fence_is_signaled(task->done_fence)) { > > > > + DRM_WARN("%s spurious timeout\n", lima_ip_name(ip)); > > > > + return DRM_GPU_SCHED_STAT_NOMINAL; > > > > + } > > > > > > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define, > > > and stop selling bandaid like this because you have options like locating > > > the reasons behind timeout. > > > > This chang do look like to aim for 2FPS apps. Maybe 500ms is too short > > for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses > > 10s timeout. So increasing the timeout seems to be an equivalent and better > > way? > > Indeed 500ms might be too optimistic for the sort of applications that > users expect to run on this hardware currently. For a more similar > reference though, other embedded drivers like v3d and panfrost do > still set it to 500ms. Note that this patch is just exactly the same > as exists in Panfrost today and was already discussed with some common > arguments in the patches of this series: > https://patchwork.freedesktop.org/series/120820/ > > But I would agree to bump the timeout to a higher value for lima. Some > distributions are already doing this with module parameters anyway to > even be able to run some more demanding application stacks on a Mali > 400. > > Another thing we might consider (probably in a followup patchset to > not delay these fixes forever for the people hitting this issue) is to > configure the Mali hardware watchdog to the value that we would like > as a timeout. That way we would get timeout jobs going through the > same error irq path as other hardware error jobs and might be able to > delete(?)/simplify this software timeout code. > This way should be much simpler and stable. > In the meantime for v2 of this series I'll make the change to account > for the multiple pp irqs. So do you agree it is ok to leave > drm_sched_stop() as it is at least for this series? > I'm OK with this. > Thanks all for the reviews > > Erico