Reviewed-by: Qiang Yu <yuq825@xxxxxxxxx> On Tue, Apr 2, 2024 at 5:20 AM Erico Nunes <nunes.erico@xxxxxxxxx> wrote: > > There is a race condition in which a rendering job might take just long > enough to trigger the drm sched job timeout handler but also still > complete before the hard reset is done by the timeout handler. > This runs into race conditions not expected by the timeout handler. > In some very specific cases it currently may result in a refcount > imbalance on lima_pm_idle, with a stack dump such as: > > [10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0 > ... > [10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0 > ... > [10136.669628] Call trace: > [10136.669634] lima_devfreq_record_idle+0xa0/0xb0 > [10136.669646] lima_sched_pipe_task_done+0x5c/0xb0 > [10136.669656] lima_gp_irq_handler+0xa8/0x120 > [10136.669666] __handle_irq_event_percpu+0x48/0x160 > [10136.669679] handle_irq_event+0x4c/0xc0 > > We can prevent that race condition entirely by masking the irqs at the > beginning of the timeout handler, at which point we give up on waiting > for that job entirely. > The irqs will be enabled again at the next hard reset which is already > done as a recovery by the timeout handler. > > Signed-off-by: Erico Nunes <nunes.erico@xxxxxxxxx> > --- > drivers/gpu/drm/lima/lima_sched.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c > index 66841503a618..bbf3f8feab94 100644 > --- a/drivers/gpu/drm/lima/lima_sched.c > +++ b/drivers/gpu/drm/lima/lima_sched.c > @@ -430,6 +430,13 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job > return DRM_GPU_SCHED_STAT_NOMINAL; > } > > + /* > + * The task might still finish while this timeout handler runs. > + * To prevent a race condition on its completion, mask all irqs > + * on the running core until the next hard reset completes. > + */ > + pipe->task_mask_irq(pipe); > + > if (!pipe->error) > DRM_ERROR("%s job timeout\n", lima_ip_name(ip)); > > -- > 2.44.0 >