On Mon, Oct 23, 2023 at 02:39:37PM +0200, Boris Brezillon wrote: > On Mon, 23 Oct 2023 14:16:06 +0200 > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote: > > > Hi, > > > > On Tue, 17 Oct 2023 08:09:56 -0700 > > Matthew Brost <matthew.brost@xxxxxxxxx> wrote: > > > > > +static void drm_sched_run_job_work(struct work_struct *w) > > > +{ > > > + struct drm_gpu_scheduler *sched = > > > + container_of(w, struct drm_gpu_scheduler, work_run_job); > > > + struct drm_sched_entity *entity; > > > + struct dma_fence *fence; > > > + struct drm_sched_fence *s_fence; > > > + struct drm_sched_job *sched_job; > > > + int r; > > > > > > - atomic_inc(&sched->hw_rq_count); > > > - drm_sched_job_begin(sched_job); > > > + if (READ_ONCE(sched->pause_submit)) > > > + return; > > > + > > > + entity = drm_sched_select_entity(sched, true); > > > + if (!entity) > > > + return; > > > > > > - trace_drm_run_job(sched_job, entity); > > > - fence = sched->ops->run_job(sched_job); > > > + sched_job = drm_sched_entity_pop_job(entity); > > > + if (!sched_job) { > > > complete_all(&entity->entity_idle); > > > - drm_sched_fence_scheduled(s_fence, fence); > > > + return; /* No more work */ > > > + } > > > > > > - if (!IS_ERR_OR_NULL(fence)) { > > > - /* Drop for original kref_init of the fence */ > > > - dma_fence_put(fence); > > > + s_fence = sched_job->s_fence; > > > > > > - r = dma_fence_add_callback(fence, &sched_job->cb, > > > - drm_sched_job_done_cb); > > > - if (r == -ENOENT) > > > - drm_sched_job_done(sched_job, fence->error); > > > - else if (r) > > > - DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", > > > - r); > > > - } else { > > > - drm_sched_job_done(sched_job, IS_ERR(fence) ? > > > - PTR_ERR(fence) : 0); > > > - } > > > + atomic_inc(&sched->hw_rq_count); > > > + drm_sched_job_begin(sched_job); > > > > > > - wake_up(&sched->job_scheduled); > > > + trace_drm_run_job(sched_job, entity); > > > + fence = sched->ops->run_job(sched_job); > > > + complete_all(&entity->entity_idle); > > > + drm_sched_fence_scheduled(s_fence, fence); > > > + > > > + if (!IS_ERR_OR_NULL(fence)) { > > > + /* Drop for original kref_init of the fence */ > > > + dma_fence_put(fence); > > > + > > > + r = dma_fence_add_callback(fence, &sched_job->cb, > > > + drm_sched_job_done_cb); > > > + if (r == -ENOENT) > > > + drm_sched_job_done(sched_job, fence->error); > > > + else if (r) > > > + DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n", r); > > > + } else { > > > + drm_sched_job_done(sched_job, IS_ERR(fence) ? > > > + PTR_ERR(fence) : 0); > > > } > > > > Just ran into a race condition when using a non-ordered workqueue > > for drm_sched: > > > > thread A thread B > > > > drm_sched_run_job_work() > > drm_sched_job_begin() > > // inserts jobA in pending_list > > > > drm_sched_free_job_work() > > drm_sched_get_cleanup_job() > > // check first job in pending list > > // if s_fence->parent == NULL, consider > > // for cleanup > > ->free_job(jobA) > > drm_sched_job_cleanup() > > // set sched_job->s_fence = NULL > > > > ->run_job() > > drm_sched_fence_scheduled() > > Correction: the NULL pointer deref happens in drm_sched_job_done() > (when the driver returns an error directly) not in > drm_sched_fence_scheduled(), but the problem remains the same. > > Trying to understand this. I don't see how drm_sched_get_cleanup_job can return a job until dma_fence_is_signaled(&job->s_fence->finished) is true. That fence is no signaled until drm_sched_fence_finished(s_fence, result); called in drm_sched_job_done(). What am I missing here? Matt > > // sched_job->s_fence->parent = parent_fence > > // BOOM => NULL pointer deref > > > > For now, I'll just use a dedicated ordered wq, but if we claim > > multi-threaded workqueues are supported, this is probably worth fixing. > > I know there's been some discussions about when the timeout should be > > started, and the job insertion in the pending_list is kinda related. > > If we want this insertion to happen before ->run_job() is called, we > > need a way to flag when a job is inserted, but not fully submitted yet. > > > > Regards, > > > > Boris >