On 04/10/2019 16:53, Grodzovsky, Andrey wrote: > > On 10/3/19 4:34 AM, Neil Armstrong wrote: >> Hi Andrey, >> >> Le 02/10/2019 à 16:40, Grodzovsky, Andrey a écrit : >>> On 9/30/19 10:52 AM, Hillf Danton wrote: >>>> On Mon, 30 Sep 2019 11:17:45 +0200 Neil Armstrong wrote: >>>>> Did a new run from 5.3: >>>>> >>>>> [ 35.971972] Call trace: >>>>> [ 35.974391] drm_sched_increase_karma+0x5c/0xf0 >>>>> ffff000010667f38 FFFF000010667F94 >>>>> drivers/gpu/drm/scheduler/sched_main.c:335 >>>>> >>>>> The crashing line is : >>>>> if (bad->s_fence->scheduled.context == >>>>> entity->fence_context) { >>>>> >>>>> Doesn't seem related to guilty job. >>>> Bail out if s_fence is no longer fresh. >>>> >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -333,6 +333,10 @@ void drm_sched_increase_karma(struct drm >>>> >>>> spin_lock(&rq->lock); >>>> list_for_each_entry_safe(entity, tmp, &rq->entities, list) { >>>> + if (!smp_load_acquire(&bad->s_fence)) { >>>> + spin_unlock(&rq->lock); >>>> + return; >>>> + } >>>> if (bad->s_fence->scheduled.context == >>>> entity->fence_context) { >>>> if (atomic_read(&bad->karma) > >>>> @@ -543,7 +547,7 @@ EXPORT_SYMBOL(drm_sched_job_init); >>>> void drm_sched_job_cleanup(struct drm_sched_job *job) >>>> { >>>> dma_fence_put(&job->s_fence->finished); >>>> - job->s_fence = NULL; >>>> + smp_store_release(&job->s_fence, 0); >>>> } >>>> EXPORT_SYMBOL(drm_sched_job_cleanup); >> This fixed the problem on the 10 CI runs. >> >> Neil > > > These are good news but I still fail to see how this fixes the problem - > Hillf, do you mind explaining how you came up with this particular fix - > what was the bug you saw ? As Steven explained, seems the same job was submitted on both HW slots, and then when timeout occurs each thread calls panfrost_job_timedout which leads to drm_sched_stop() on the first call and on the second call the job was already freed. Steven proposed a working fix, and this one does the same but on the drm_sched side. This one looks cleaner, but panfrost should not call drm_sched_stop() twice for the same job. Neil > > Andrey > >> >>> Does this change help the problem ? Note that drm_sched_job_cleanup is >>> called from scheduler thread which is stopped at all times when work_tdr >>> thread is running and anyway the 'bad' job is still in the >>> ring_mirror_list while it's being accessed from >>> drm_sched_increase_karma so I don't think drm_sched_job_cleanup can be >>> called for it BEFORE or while drm_sched_increase_karma is executed. >>> >>> Andrey >>> >>> >>>> >>>> -- >>>> _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel