On Mon, 2025-03-10 at 11:54 +0000, Tvrtko Ursulin wrote: > > On 10/03/2025 11:11, Philipp Stanner wrote: > > On Mon, 2025-03-10 at 09:55 +0000, Tvrtko Ursulin wrote: > > > > > > On 07/03/2025 18:06, Philipp Stanner wrote: > > > > On Fri, 2025-03-07 at 16:59 +0000, Tvrtko Ursulin wrote: > > > > > > > > > > On 07/03/2025 13:41, Philipp Stanner wrote: > > > > > > Hi, > > > > > > > > > > > > You forgot to put folks in CC as recipents for the cover > > > > > > letter > > > > > > :( > > > > > > > > > > > > > > > > > > On Thu, 2025-03-06 at 17:05 +0000, Tvrtko Ursulin wrote: > > > > > > > Move some options out into a new debug specific kconfig > > > > > > > file > > > > > > > in > > > > > > > order > > > > > > > to > > > > > > > make things a bit cleaner. > > > > > > > > > > > > > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> > > > > > > > Cc: Christian König <christian.koenig@xxxxxxx> > > > > > > > Cc: Danilo Krummrich <dakr@xxxxxxxxxx> > > > > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > > > > Cc: Philipp Stanner <phasta@xxxxxxxxxx> > > > > > > > > > > > > We all have our individual work flows, so don't take this > > > > > > as > > > > > > lecturing > > > > > > or anything – I just suspect that I was forgotten in the > > > > > > cover > > > > > > letter > > > > > > because you Cc people by hand in the individual patches. > > > > > > > > > > > > What I do is that I run get_maintainer and then put the > > > > > > individuals > > > > > > listed there into the --to= field. That sends the entire > > > > > > series > > > > > > to > > > > > > all > > > > > > of them. > > > > > > > > > > > > Only sometimes, when there's a huge list of recipents or > > > > > > when > > > > > > the > > > > > > patches of a series are very independent, I deviate from > > > > > > that > > > > > > rule. > > > > > > > > > > > > JFYI > > > > > > > > > > Notice it was there in v3, I just omitted to paste it this > > > > > time. > > > > > > > > > > > Anyways, we have a bigger problem about the entire series. > > > > > > I > > > > > > now > > > > > > tested > > > > > > again with the same setup as yesterday and the faults are > > > > > > indeed > > > > > > gone, > > > > > > so that's good. > > > > > > > > > > > > But to be sure I then did run kmemleak and got a list of > > > > > > leaks > > > > > > that > > > > > > is > > > > > > more than 2000 lines long. > > > > > > > > > > There is this comment for drm_sched_fini which ends with: > > > > > > > > > > """ > > > > > ... > > > > > * This stops submission of new jobs to the hardware > > > > > through > > > > > * drm_sched_backend_ops.run_job(). Consequently, > > > > > drm_sched_backend_ops.free_job() > > > > > * will not be called for all jobs still in > > > > > drm_gpu_scheduler.pending_list. > > > > > * There is no solution for this currently. Thus, it is up > > > > > to > > > > > the > > > > > driver to make > > > > > * sure that: > > > > > * > > > > > * a) drm_sched_fini() is only called after for all > > > > > submitted > > > > > jobs > > > > > * drm_sched_backend_ops.free_job() has been called or > > > > > that > > > > > * b) the jobs for which drm_sched_backend_ops.free_job() > > > > > has > > > > > not > > > > > been > > > > > called > > > > > * after drm_sched_fini() ran are freed manually. > > > > > * > > > > > > > > > > * FIXME: Take care of the above problem and prevent this > > > > > function > > > > > from > > > > > leaking > > > > > * the jobs in drm_gpu_scheduler.pending_list under any > > > > > circumstances. > > > > > """ > > > > > > > > > > I got bitten by that. Keep forgetting how fragile the thing > > > > > is.. > > > > > :( > > > > > > > > argh damn, those are *all* from the pending_list?! > > > > > > Right, all leaks I saw were from the > > > drm_sched_basic_entity_cleanup > > > test. All other tests actually wait for jobs to finish so can't > > > hit > > > that. > > > > > > Fix was simply to add a drm_sched_job_cleanup call when unwinding > > > unfinished mock scheduler jobs from drm_mock_sched_fini, which > > > happens > > > before calling drm_sched_fini. > > > > > > That's pretty much how things are expected to be handled AFAIU. > > > > > > > OK. Well. > > > > > > > > Now we've got a philosophical problem: > > > > > > > > We still have to fix those leaks (I'm still working on it, but > > > > my > > > > current attempt has failed and I probably fall back to a > > > > refcount > > > > solution). > > > > > > You propose to move the responsibility of cleaning up in-flight > > > jobs > > > to > > > the scheduler core? > > > > The scheduler core is already and has always been responsible for > > cleaning up "in-flight jobs". It does so through > > backend_ops.free_job(). And we prevent it from cleaning up all jobs > > by > > cancelling the work items in drm_sched_fini(). > > > > Semantically, the scheduler is the one in charge of the job life > > times. > > > > As of right now, every single driver is effectively forced to > > implement > > the same logic, but they have implemented it in different ways (Xe > > refcounts the scheduler and only calls drm_sched_fini() once refcnt > > == > > 0, Nouveau maintains a copy of the pending_list, blocking for it to > > become empty before calling drm_sched_fini()) > > Right. And to change it means making ->free_job() for all drivers > handle > different potential job states, while today it only needs to handle > finished jobs. Or adding a new vfunc. Or something. It sounds doable > but > a lot of work, not least because there is a lot of drivers. I know the pain, I'm working on that since ~November. I plan to propose a new solution "soon"(tm) > > > > > And to see whether the fix actually fixes the leaks, directly > > > > using > > > > the > > > > kunit tests would be handy. > > > > > > > > After all, this is what the kunit tests are there for: show > > > > what is > > > > broken within the scheduler. And those leaks definitely > > > > qualify. Or > > > > should kunit tests follow the same rules we demand from > > > > drivers? > > > > > > > > I'd like to hear more opinions about that. > > > > > > > > @Danilo, @Dave, @Sima > > > > would it be OK if we add kunit tests for the scheduler to DRM > > > > that > > > > cause leaks until we can fix them? > > > > > > It is indeed a bit philosophical. I'd say only if there is a 100% > > > agreement that drm_sched_fini should be able to clean up, or > > > drive > > > cleaning up, all driver state. And if we are prepared to handle a > > > permanently failing test from now to some future date when this > > > would > > > be > > > implemented. > > > > > > I have a similar conundrum with set priority, where I was > > > contemplating > > > to add a permanently failing test showing how that does not fully > > > work, > > > and then get improved with my deadline scheduling series. > > > > > > On the other side of the argument is the past experience of CI > > > systems > > > generally not coping well with permanently failing test. > > > Eventually > > > they > > > succumb to the pressure to remove them due noisy results. > > > Therefore > > > other option is to have the mock scheduler adhere to the current > > > implementation and only change it once the DRM scheduler rules > > > change. > > > > Can you think of a way, like flags or kconfig options, with which > > developers such as you and I could "switch the bugs on" for working > > on > > those issues? > > We could do that easily I think. Something like: > > config DRM_SCHED_KUNIT_TEST_ASPIRATIONAL > bool "Turn on the aspirational mode for DRM scheduler unit > tests" if !KUNIT_ALL_TESTS > select DRM_SCHED > depends on DRM && KUNIT && DRM_SCHED_KUNIT_TEST > default KUNIT_ALL_TESTS > help > Choose this option to make the DRM scheduler unit tests > test > for behaviour which was agreed as a design goal, even if the current > implementation can make specific tests fail. > > Recommended for driver developers only. > > If in doubt, say "N". If you can work out something like this, that would be fantastic and solve all the aforementioned problems > > I can skip the job cleanup based on it and also add some validation > that > the pending list is empty after drm_sched_fini if on. Sounds handy. That way the developer doesn't even have to use kmemleak. P. > > Regards, > > Tvrtko >