On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825@xxxxxxxxx> wrote: > > On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@xxxxxxxx> wrote: > > > > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@xxxxxxxxx> > > > > > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job > > > struct lima_sched_pipe *pipe = to_lima_pipe(job->sched); > > > struct lima_sched_task *task = to_lima_task(job); > > > struct lima_device *ldev = pipe->ldev; > > > + struct lima_ip *ip = pipe->processor[0]; > > > + > > > + /* > > > + * If the GPU managed to complete this jobs fence, the timeout is > > > + * spurious. Bail out. > > > + */ > > > + if (dma_fence_is_signaled(task->done_fence)) { > > > + DRM_WARN("%s spurious timeout\n", lima_ip_name(ip)); > > > + return DRM_GPU_SCHED_STAT_NOMINAL; > > > + } > > > > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define, > > and stop selling bandaid like this because you have options like locating > > the reasons behind timeout. > > This chang do look like to aim for 2FPS apps. Maybe 500ms is too short > for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses > 10s timeout. So increasing the timeout seems to be an equivalent and better > way? Indeed 500ms might be too optimistic for the sort of applications that users expect to run on this hardware currently. For a more similar reference though, other embedded drivers like v3d and panfrost do still set it to 500ms. Note that this patch is just exactly the same as exists in Panfrost today and was already discussed with some common arguments in the patches of this series: https://patchwork.freedesktop.org/series/120820/ But I would agree to bump the timeout to a higher value for lima. Some distributions are already doing this with module parameters anyway to even be able to run some more demanding application stacks on a Mali 400. Another thing we might consider (probably in a followup patchset to not delay these fixes forever for the people hitting this issue) is to configure the Mali hardware watchdog to the value that we would like as a timeout. That way we would get timeout jobs going through the same error irq path as other hardware error jobs and might be able to delete(?)/simplify this software timeout code. In the meantime for v2 of this series I'll make the change to account for the multiple pp irqs. So do you agree it is ok to leave drm_sched_stop() as it is at least for this series? Thanks all for the reviews Erico