Re: [PATCH v7 0/7] DRM scheduler kunit tests

Philipp Stanner <phasta@xxxxxxxxxxx> · Thu, 13 Mar 2025 09:34:56 +0100

On Wed, 2025-03-12 at 14:33 +0000, Tvrtko Ursulin wrote:
> 
> On 12/03/2025 13:49, Philipp Stanner wrote:
> 
> [snip]
> 
> > > > > There I see a UAF. Do you have an idea what that might be? I
> > > > > would only
> > > > > expect memory leaks and the test assertions failing.
> > > > 
> > > > It is expected all hell to break loose in aspirational mode.
> > > > There
> > > > the
> > > > mock scheduler shutdown relies solely on drm_sched_fini. So all
> > > > tests
> > > > which do not fully idle the mock scheduler themselves, whether
> > > > implicity
> > > > or explicitly, will explode in various ways.
> > > 
> > > To clarify I made it like that because that is what I understood
> > > was
> > > your vision for drm_sched_fini. That it would be able to clean
> > > everything up etc.
> > > 
> > > If you meant something different we need to adjust it. Although
> > > it
> > > could
> > > be done later as well.
> > 
> > My primary intention was to use those tests to see whether my
> > memory
> > leak fix works or not.
> 
> Okay here I don't know exactly what are you changing and how you are 
> testing to comment.
> 
> > Thus I'd need something like in v4, where the tests run as intended
> > but
> > leak the jobs in the pending_list. Would indeed be neat if they
> > also
> > check for list_len(pending_list) so that you don't have to run
> > kmemleak
> > manually.
> > 
> > And since we shouldn't have such leaks in production tests the idea
> > of
> > optionally enabling them arose.
> > 
> > More generally speaking, I think that the tests should behave as we
> > expect drivers to behave *currently*, while we have the option for
> > the
> > tests to purposefully misbehave. In my case this means: just leak
> > the
> > jobs and optionally tell about the leaks.
> > 
> > When used as intended, the scheduler currently AFAIK doesn't have
> > UAF
> > or nullptrderefs, so they shouldn't occur in tests that use it as
> > intended. Now the leaks are special because we never intended them.
> > 
> > So what surprised me is that, when compared to v4, this v7 here
> > actually now also has UAF. Besides not cleaning up leaks, what are
> > you
> > doing in aspiritional mode?
> > 
> > Let's talk it through before you invest your time into v8
> 
> 1)
> 
> "Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is 
> exercising the scheduler how drivers are supposed to use it today.

Yes, it is fine.

> 
> 2)
> 
> Aspirational mode I added on your request and you can define how you 
> want it to work.

My personal intention is simple: have the jobs leaked like in v4,
without UAF or faults and the like.

> 
> Lets have a look in drm_mock_sched_fini() and see that it is pretty 
> straightforward. First I'll pre-process and annotate the normal
> version:
> 
> void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> {
> 	struct drm_mock_sched_job *job, *next;
> 	struct kunit  *test = sched->test;
> 	unsigned long flags;
> 	LIST_HEAD(signal);
> 
> 	// Stop the scheduler workqueues so nothing further gets
> scheduled
> 	drm_sched_wqueue_stop(&sched->base);
> 
> 	// Idle the mock scheduler in-flight jobs:
> 
> 	// First move them to a local list so any in parallel
> "hardware" 
> activity does not see them any more
> 
> 	spin_lock_irqsave(&sched->lock, flags);
> 	list_for_each_entry_safe(job, next, &sched->job_list, link)
> 		list_move_tail(&job->link, &signal);
> 	spin_unlock_irqrestore(&sched->lock, flags);
> 
> 	// Now for jobs which "hardware" hasn't processed yet we do:
> 	//
> 	// 1. Cancel the completion timer
> 	// 2. Mark the job as complete (signal the hw fence, wake up
> waiters)
> 	// 3. Release the hardware fence (list had a reference)
> 	// 4. Free the job itself
> 
> 	list_for_each_entry(job, &signal, link) {
> 		hrtimer_cancel(&job->timer);
> 		drm_mock_sched_job_complete(job);
> 		dma_fence_put(&job->hw_fence);
> 		drm_sched_job_cleanup(&job->base);
> 	}
> 
> 	// Finally drm_sched_fini
> 	drm_sched_fini(&sched->base);
> }
> 
> In aspirational mode it becomes this:
> 
> void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> {
> 	struct kunit  *test = sched->test;
> 
> 	drm_sched_fini(&sched->base);
> 
> 	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
> 	KUNIT_ASSERT_TRUE(test, list_empty(&sched-
> >base.pending_list));
> }
> 
> So relies on drm_sched_fini to cleanup _everything_. Including the 
> driver (mock scheduler) state. Which is what I understood you wanted.

Thx for the detailed explanation. _Everything_ seems a lot, though.
Even in this version, you still stop submission to your entities, don't
you?

IOW: is drm_sched_fini(), in aspirational mode, responsible for more
than preventing the leaks (and for what it is already doing in mainline
linux)?

> 
> For example drm_sched_basic_entity_cleanup test case will exit (kill
> the 
> entities and tear down the scheduler) with half of the submitted jobs
> still in flight. That's one example of what will be caught here by
> the 
> asserts and also UAF since kunit test will just exit and free up
> memory 
> regardless.

Hmm. OK, so kunit does the free. But who does the access?
drm_sched_fini() still stops all the work items.

> 
> If this isn't what you had in mind you can either a) tell me what you
> want maybe I can tweak it*, or b) we can drop the aspirational patch
> for 
> now.

So, here's the bad news. I had a chat with Danilo yesterday who
reminded me of the fact that there are testers out there (like the
Intel bots) which do random kconfigs - so avoiding that with another
kconfig option does not solve the problem.

Thus, I think the way to go is to indeed drop aspirational mode and
whoever works on Schrödinger-Problems (like myself) has to tweak the
tests locally.

That's probably even more productive process-wise, otherwise we'd have
to remove broken aspects of (or the entire) aspirational mode every
time that we solved a known problem.

Unless you have a better proposal, let's drop that mode.

Sorry for the zig-zag, but that wasn't on my / our radar

P.

> 
> *)
> For example it would be possible to avoid UAFs by moving away from
> kunit 
> managed allocations and just leak memory. Plus, it would also be 
> required to cancel the timers or so.
> 
> For running in the VM or UML cases, which I thought were typical, it 
> wouldn't be a benefit, but if you are worried about running on the
> host 
> kernel and expect no crashes, then that aspect would need to change.
> 
> Regards,
> 
> Tvrtko
>