On Thu, 2024-11-21 at 10:05 -0800, Matthew Brost wrote: > On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote: > > Christian, Sima? > > > > Matthew? (+CC) > > > > Sorry catching up here. Internally in Xe we ref count the scheduler > as > the scheduler is embedded into our user queue structure > (xe_exec_queue). > Jobs ref the queue once they are armed. Upon queue killing we set TDR > to > zero which will flush out all jobs / signal all the job's fences. > Once > the ref count of queue is zero we do hardware / firmware teardown of > the > queue and then finally call drm_sched_fini before freeing the queues > memory (and thus the scheduler). > > This flow seems to work pretty well but largely bypasses the > scheduler > functions to implement this. Not sure if this flow could be > normalized > at all but I would expect usage models of a 1 to 1 relationship > between > queue and scheduler to something similar to Xe's flow. One of the solution I proposed basically does the same: jobs added to gpu_scheduler.pending_list refcount the scheduler. If we'd go for that solution I assume the separate refcounting could be removed from Xe. For exmaple. > Maybe we could > write this done as design guideline so other drivers don't have to > figure this out - this took me a bit to land on this. I already improved the documentation a while ago [1], but didn't detail how drivers are supposed to solve this. The reason is that I don't want to encourage more separate solutions while we're working on solving it centrally. > With that, in general I agree with Christian's patch here complaining > about pending jobs if drm_sched_fini is called. Yes, firing a WARN_ON there is also fine from my POV. P. [1] https://lore.kernel.org/dri-devel/20241105143137.71893-2-pstanner@xxxxxxxxxx/ > > > Opinions on the below? > > > > tl;dr: > > I still think it's a good thing to detectably block in > > drm_sched_fini(), or at the very least drm_sched_flush(), because > > then > > See above. I don't think drm_sched_fini should block rather just > complain this patch does which says 'go fix your driver'. > > Matt > > > you'll find out that the driver is broken and can repair it. > > > > P. > > > > > > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote: > > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote: > > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König > > > > wrote: > > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner: > > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König > > > > > > > wrote: > > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König > > > > > > > > > wrote: > > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian > > > > > > > > > > > König > > > > > > > > > > > wrote: > > > > > > > > > > > > Tearing down the scheduler with jobs still on > > > > > > > > > > > > the > > > > > > > > > > > > pending > > > > > > > > > > > > list > > > > > > > > > > > > can > > > > > > > > > > > > lead to use after free issues. Add a warning if > > > > > > > > > > > > drivers try > > > > > > > > > > > > to > > > > > > > > > > > > destroy a scheduler which still has work pushed > > > > > > > > > > > > to > > > > > > > > > > > > the HW. > > > > > > > > > > > Did you have time yet to look into my proposed > > > > > > > > > > > waitque- > > > > > > > > > > > solution? > > > > > > > > > > I don't remember seeing anything. What have I > > > > > > > > > > missed? > > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@xxxxxxxxxx/ > > > > > > > > Mhm, I didn't got that in my inbox for some reason. > > > > > > > > > > > > > > > > Interesting approach, I'm just not sure if we can or > > > > > > > > should > > > > > > > > wait in > > > > > > > > drm_sched_fini(). > > > > > > We do agree that jobs still pending when drm_sched_fini() > > > > > > starts > > > > > > is > > > > > > always a bug, right? > > > > > > > > > > Correct, the question is how to avoid that. > > > > > > > > > > > If so, what are the disadvantages of waiting in > > > > > > drm_sched_fini()? > > > > > > We > > > > > > could block buggy drivers as I see it. Which wouldn't be > > > > > > good, > > > > > > but > > > > > > could then be fixed on drivers' site. > > > > > > > > > > Sima explained that pretty well: Don't block in fops->close, > > > > > do > > > > > that in > > > > > fops->flush instead. > > > > > > > > I agree that we shouldn't block in close(), but this > > > > effectively > > > > means that we > > > > need to reference count the scheduler, right? > > > > > > > > Otherwise, if we allow to just skip / interrupt the teardown, > > > > we > > > > can > > > > still leak > > > > memory. > > > > > > Having thought about it, I agree with Danilo. Having something > > > that > > > shall wait on green light, but can be interrupted, is no > > > guarantee > > > and > > > therefore not a feasible solution. > > > > > > To break down the solution space, these seem to be our options: > > > 1. We have something (either drm_sched_fini() or a helper, > > > e.g., > > > drm_sched_flush()) that definitely blocks until the pending > > > list > > > has become empty. > > > 2. We have jobs reference-count the scheduler, so the latter > > > can > > > outlive the driver and will be freed some time later. > > > > > > Can anyone think of a third solution? > > > > > > > > > Solution #1 has the problem of obviously blocking unconditionally > > > if > > > the driver didn't make sure that all fences will be signaled > > > within > > > reasonable time. In my opinion, this would actually be an > > > advantage, > > > because it will be *very* noticable and force users to repair > > > their > > > driver. The driver *has* to guarantee that all fences will be > > > signaled. > > > If the driver has to do fishy things, having the blocking > > > outsourced > > > to > > > the helper drm_sched_flush() would allow them to circumvent that. > > > > > > Solution #2 has the problem of backend_ops.free_job() potentially > > > using > > > driver-data after the driver is gone, causing UAF. So with this > > > solutions all drivers would have to be aware of the issue and > > > handle > > > it > > > through one of DRMs primitives dedicated to such problems. > > > > > > > > > Currently, all drivers either work around the problem internally > > > or > > > simply ignore it, it seems. > > > > > > So I'd argue that both solutions are an improvement over the > > > existing > > > situation. My preference would be #1. > > > > > > > > > Opinions? > > > > > > P. > > > > > > > > > > > > > > > > > One issue this solves is that when you send a SIGTERM the > > > > > tear > > > > > down > > > > > handling > > > > > first flushes all the FDs and then closes them. > > > > > > > > > > So if flushing the FDs blocks because the process initiated > > > > > sending > > > > > a > > > > > terabyte of data over a 300bps line (for example) you can > > > > > still > > > > > throw a > > > > > SIGKILL and abort that as well. > > > > > > > > > > If you would block in fops-close() that SIGKILL won't have > > > > > any > > > > > effect any > > > > > more because by the time close() is called the process is > > > > > gone > > > > > and > > > > > signals > > > > > are already blocked. > > > > > > > > > > And yes when I learned about that issue I was also buffed > > > > > that > > > > > handling like > > > > > this in the UNIX design is nearly 50 years old and still > > > > > applies > > > > > to > > > > > today. > > > > > > > > Probably better to make that a separate function, > > > > > > > > something > > > > > > > > like > > > > > > > > drm_sched_flush() or similar. > > > > > > We could do that. Such a function could then be called by > > > > > > drivers > > > > > > which > > > > > > are not sure whether all jobs are done before they start > > > > > > tearing > > > > > > down. > > > > > > > > > > Yes exactly that's the idea. And give that flush function a > > > > > return > > > > > code so > > > > > that it can return -EINTR. > > > > > > > > > > > > Yeah I don't think we should smash this into > > > > > > > drm_sched_fini > > > > > > > unconditionally. I think conceptually there's about three > > > > > > > cases: > > > > > > > > > > > > > > - Ringbuffer schedules. Probably want everything as-is, > > > > > > > because > > > > > > > drm_sched_fini is called long after all the entities > > > > > > > are > > > > > > > gone in > > > > > > > drm_device cleanup. > > > > > > > > > > > > > > - fw scheduler hardware with preemption support. There we > > > > > > > probably > > > > > > > want to > > > > > > > nuke the context by setting the tdr timeout to zero > > > > > > > (or > > > > > > > maybe just > > > > > > > as > > > > > > > long as context preemption takes to be efficient), and > > > > > > > relying on > > > > > > > the > > > > > > > normal gpu reset flow to handle things. > > > > > > > drm_sched_entity_flush > > > > > > > kinda > > > > > > > does this, except not really and it's a lot more > > > > > > > focused > > > > > > > on > > > > > > > the > > > > > > > ringbuffer context. So maybe we want a new > > > > > > > drm_sched_entity_kill. > > > > > > > > > > > > > > For this case calling drm_sched_fini() after the 1:1 > > > > > > > entity > > > > > > > is gone > > > > > > > should not find any linger jobs, it would actually be > > > > > > > a > > > > > > > bug > > > > > > > somewhere if > > > > > > > there's a job lingering. Maybe a sanity check that > > > > > > > there's > > > > > > > not just > > > > > > > no > > > > > > > jobs lingering, but also no entity left would be good > > > > > > > here? > > > > > > The check for lingering ones is in Christian's patch here > > > > > > IISC. > > > > > > At which position would you imagine the check for the > > > > > > entity > > > > > > being > > > > > > performed? > > > > > > > > > > > > > - fw scheduler without preemption support. There we kinda > > > > > > > need > > > > > > > the > > > > > > > drm_sched_flush, except blocking in fops->close is not > > > > > > > great. So > > > > > > > instead > > > > > > > I think the following is better: > > > > > > > 1. drm_sched_entity_stopped, which only stops new > > > > > > > submissions (for > > > > > > > paranoia) but doesn't tear down the entity > > > > > > Who would call that function? > > > > > > Drivers using it voluntarily could just as well stop > > > > > > accepting > > > > > > new jobs > > > > > > from userspace to their entities, couldn't they? > > > > > > > > > > > > > 2. drm_dev_get > > > > > > > 3. launch a worker which does a) drm_sched_flush (or > > > > > > > drm_sched_entity_flush or whatever we call it) b) > > > > > > > drm_sched_entity_fini > > > > > > > + drm_sched_fini c) drm_dev_put > > > > > > > > > > > > > > Note that semantically this implements the refcount in > > > > > > > the > > > > > > > other > > > > > > > path > > > > > > > from Phillip: > > > > > > > > > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@xxxxxxxxxx/ > > > > > > > Except it doesn't impose refcount on everyone else who > > > > > > > doesn't need > > > > > > > it, > > > > > > > and it doesn't even impose refcounting on drivers that > > > > > > > do > > > > > > > need it > > > > > > > because we use drm_sched_flush and a worker to achieve > > > > > > > the > > > > > > > same. > > > > > > I indeed wasn't happy with the refcount approach for that > > > > > > reason, > > > > > > agreed. > > > > > > > > > > > > > Essentially helper functions for the common use-cases > > > > > > > instead > > > > > > > of > > > > > > > trying to > > > > > > > solve them all by putting drm_sched_flush as a > > > > > > > potentially > > > > > > > very > > > > > > > blocking > > > > > > > function into drm_sched_fini. > > > > > > I'm still not able to see why it blocking would be > > > > > > undesired – > > > > > > as > > > > > > far > > > > > > as I can see, it is only invoked on driver teardown, so not > > > > > > during > > > > > > active operation. Teardown doesn't happen that often, and > > > > > > it > > > > > > can > > > > > > (if > > > > > > implemented correctly) only block until the driver's code > > > > > > has > > > > > > signaled > > > > > > the last fences. If that doesn't happen, the block would > > > > > > reveal > > > > > > a > > > > > > bug. > > > > > > > > > > > > But don't get me wrong: I don't want to *push* this > > > > > > solution. I > > > > > > just > > > > > > want to understand when it could become a problem. > > > > > > > > > > > > > > > > > > Wouldn't an explicitly blocking, separate function like > > > > > > drm_sched_flush() or drm_sched_fini_flush() be a small, > > > > > > doable > > > > > > step > > > > > > towards the right direction? > > > > > > > > > > I think that this is the right thing to do, yes. > > > > > > > > > > > > > > > > > When there are still entities with jobs the > > > > > > > > > > > > situation > > > > > > > > > > > > is > > > > > > > > > > > > even > > > > > > > > > > > > worse > > > > > > > > > > > > since the dma_fences for those jobs can never > > > > > > > > > > > > signal > > > > > > > > > > > > we can > > > > > > > > > > > > just > > > > > > > > > > > > choose between potentially locking up core > > > > > > > > > > > > memory > > > > > > > > > > > > management and > > > > > > > > > > > > random memory corruption. When drivers really > > > > > > > > > > > > mess > > > > > > > > > > > > it > > > > > > > > > > > > up > > > > > > > > > > > > that > > > > > > > > > > > > well > > > > > > > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Christian > > > > > > > > > > > > König<christian.koenig@xxxxxxx> > > > > > > > > > > > > --- > > > > > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 > > > > > > > > > > > > ++++++++++++++++++- > > > > > > > > > > > > 1 file changed, 18 insertions(+), 1 > > > > > > > > > > > > deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > @@ -1333,17 +1333,34 @@ void > > > > > > > > > > > > drm_sched_fini(struct > > > > > > > > > > > > drm_gpu_scheduler > > > > > > > > > > > > *sched) > > > > > > > > > > > I agree with Sima that it should first be > > > > > > > > > > > documented > > > > > > > > > > > in > > > > > > > > > > > the > > > > > > > > > > > function's > > > > > > > > > > > docstring what the user is expected to have done > > > > > > > > > > > before > > > > > > > > > > > calling the > > > > > > > > > > > function. > > > > > > > > > > Good point, going to update the documentation as > > > > > > > > > > well. > > > > > > > > > Cool thing, thx. > > > > > > > > > Would be great if everything (not totally trivial) > > > > > > > > > necessary to > > > > > > > > > be done > > > > > > > > > before _fini() is mentioned. > > > > > > > > > > > > > > > > > > One could also think about providing a hint at how > > > > > > > > > the > > > > > > > > > driver can > > > > > > > > > do > > > > > > > > > that. AFAICS the only way for the driver to ensure > > > > > > > > > that > > > > > > > > > is > > > > > > > > > to > > > > > > > > > maintain > > > > > > > > > its own, separate list of submitted jobs. > > > > > > > > Even with a duplicated pending list it's actually > > > > > > > > currently > > > > > > > > impossible to do > > > > > > > > this fully cleanly. > > > > > > > > > > > > > > > > The problem is that the dma_fence object gives no > > > > > > > > guarantee > > > > > > > > when > > > > > > > > callbacks > > > > > > > > are processed, e.g. they can be both processed from > > > > > > > > interrupt > > > > > > > > context as > > > > > > > > well as from a CPU which called > > > > > > > > dma_fence_is_signaled(). > > > > > > > > > > > > > > > > So when a driver (or drm_sched_fini) waits for the last > > > > > > > > submitted > > > > > > > > fence it > > > > > > > > actually can be that the drm_sched object still needs > > > > > > > > to do > > > > > > > > some > > > > > > > > processing. > > > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background > > > > > > > > on > > > > > > > > the > > > > > > > > problem. > > > > > > Oh dear ^^' > > > > > > We better work towards fixing that centrally > > > > > > > > > > > > Thanks, > > > > > > P. > > > > > > > > > > > > > > > > > > > So I thought this should be fairly easy because of the > > > > > > > sched > > > > > > > hw/public > > > > > > > fence split: If we wait for both all jobs to finish and > > > > > > > for > > > > > > > all > > > > > > > the > > > > > > > sched > > > > > > > work/tdr work to finish, and we make sure there's no > > > > > > > entity > > > > > > > existing > > > > > > > that's not yet stopped we should catch them all? > > > > > > > > > > Unfortunately not. > > > > > > > > > > Even when you do a dma_fence_wait() on the last submission it > > > > > can > > > > > still be > > > > > that another CPU is executing the callbacks to wake up the > > > > > scheduler work > > > > > item and cleanup the job. > > > > > > > > > > That's one of the reasons why I think the design of keeping > > > > > the > > > > > job > > > > > alive is > > > > > so extremely awkward. The dma_fence as representation of the > > > > > hw > > > > > submission > > > > > has a much better defined state machine and lifetime. > > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > > > Or at least I think > > > > > > > it's > > > > > > > a bug if any other code even tries to touch the hw fence. > > > > > > > > > > > > > > If you have any other driver code which relies on the rcu > > > > > > > freeing > > > > > > > then I > > > > > > > think that's just a separate concern and we can ignore > > > > > > > that > > > > > > > here > > > > > > > since the > > > > > > > fences themselves will till get rcu-delay freed even if > > > > > > > drm_sched_fini has > > > > > > > finished. > > > > > > > -Sima > > > > > > > > > > > > > > > Regards, > > > > > > > > Christian. > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > > > > > > > + /* > > > > > > > > > > > > + * Tearing down the scheduler wile there are > > > > > > > > > > > > still > > > > > > > > > > > > unprocessed jobs can > > > > > > > > > > > > + * lead to use after free issues in the > > > > > > > > > > > > scheduler > > > > > > > > > > > > fence. > > > > > > > > > > > > + */ > > > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > > > > > > > + > > > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < > > > > > > > > > > > > sched- > > > > > > > > > > > > > num_rqs; > > > > > > > > > > > > i++) > > > > > > > > > > > > { > > > > > > > > > > > > struct drm_sched_rq *rq = sched- > > > > > > > > > > > > > sched_rq[i]; > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > > > > > > > > entities, > > > > > > > > > > > > list) > > > > > > > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > > > > > > > > entities, > > > > > > > > > > > > list) { > > > > > > > > > > > > + /* > > > > > > > > > > > > + * The justification for this > > > > > > > > > > > > BUG_ON() > > > > > > > > > > > > is > > > > > > > > > > > > that tearing > > > > > > > > > > > > + * down the scheduler while jobs > > > > > > > > > > > > are > > > > > > > > > > > > pending > > > > > > > > > > > > leaves > > > > > > > > > > > > + * dma_fences unsignaled. Since we > > > > > > > > > > > > have > > > > > > > > > > > > dependencies > > > > > > > > > > > > + * from the core memory management > > > > > > > > > > > > to > > > > > > > > > > > > eventually signal > > > > > > > > > > > > + * dma_fences this can trivially > > > > > > > > > > > > lead to > > > > > > > > > > > > a > > > > > > > > > > > > system wide > > > > > > > > > > > > + * stop because of a locked up > > > > > > > > > > > > memory > > > > > > > > > > > > management. > > > > > > > > > > > > + */ > > > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > > > > > > > job_queue)); > > > > > > > > > > > > + > > > > > > > > > > > > /* > > > > > > > > > > > > * Prevents reinsertion and marks > > > > > > > > > > > > job_queue > > > > > > > > > > > > as idle, > > > > > > > > > > > > * it will removed from rq in > > > > > > > > > > > > drm_sched_entity_fini > > > > > > > > > > > > * eventually > > > > > > > > > > > > */ > > > > > > > > > > > > s_entity->stopped = true; > > > > > > > > > > > > + } > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > > > > > > } > > > > > > > > > >