Re: [PATCH v7 0/7] DRM scheduler kunit tests

Philipp Stanner <phasta@xxxxxxxxxxx> · Thu, 13 Mar 2025 11:29:57 +0100

On Thu, 2025-03-13 at 10:15 +0000, Tvrtko Ursulin wrote:
> 
> On 13/03/2025 08:34, Philipp Stanner wrote:
> > On Wed, 2025-03-12 at 14:33 +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 12/03/2025 13:49, Philipp Stanner wrote:
> > > 
> > > [snip]
> > > 
> > > > > > > There I see a UAF. Do you have an idea what that might
> > > > > > > be? I
> > > > > > > would only
> > > > > > > expect memory leaks and the test assertions failing.
> > > > > > 
> > > > > > It is expected all hell to break loose in aspirational
> > > > > > mode.
> > > > > > There
> > > > > > the
> > > > > > mock scheduler shutdown relies solely on drm_sched_fini. So
> > > > > > all
> > > > > > tests
> > > > > > which do not fully idle the mock scheduler themselves,
> > > > > > whether
> > > > > > implicity
> > > > > > or explicitly, will explode in various ways.
> > > > > 
> > > > > To clarify I made it like that because that is what I
> > > > > understood
> > > > > was
> > > > > your vision for drm_sched_fini. That it would be able to
> > > > > clean
> > > > > everything up etc.
> > > > > 
> > > > > If you meant something different we need to adjust it.
> > > > > Although
> > > > > it
> > > > > could
> > > > > be done later as well.
> > > > 
> > > > My primary intention was to use those tests to see whether my
> > > > memory
> > > > leak fix works or not.
> > > 
> > > Okay here I don't know exactly what are you changing and how you
> > > are
> > > testing to comment.
> > > 
> > > > Thus I'd need something like in v4, where the tests run as
> > > > intended
> > > > but
> > > > leak the jobs in the pending_list. Would indeed be neat if they
> > > > also
> > > > check for list_len(pending_list) so that you don't have to run
> > > > kmemleak
> > > > manually.
> > > > 
> > > > And since we shouldn't have such leaks in production tests the
> > > > idea
> > > > of
> > > > optionally enabling them arose.
> > > > 
> > > > More generally speaking, I think that the tests should behave
> > > > as we
> > > > expect drivers to behave *currently*, while we have the option
> > > > for
> > > > the
> > > > tests to purposefully misbehave. In my case this means: just
> > > > leak
> > > > the
> > > > jobs and optionally tell about the leaks.
> > > > 
> > > > When used as intended, the scheduler currently AFAIK doesn't
> > > > have
> > > > UAF
> > > > or nullptrderefs, so they shouldn't occur in tests that use it
> > > > as
> > > > intended. Now the leaks are special because we never intended
> > > > them.
> > > > 
> > > > So what surprised me is that, when compared to v4, this v7 here
> > > > actually now also has UAF. Besides not cleaning up leaks, what
> > > > are
> > > > you
> > > > doing in aspiritional mode?
> > > > 
> > > > Let's talk it through before you invest your time into v8
> > > 
> > > 1)
> > > 
> > > "Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is
> > > exercising the scheduler how drivers are supposed to use it
> > > today.
> > 
> > Yes, it is fine.
> > 
> > > 
> > > 2)
> > > 
> > > Aspirational mode I added on your request and you can define how
> > > you
> > > want it to work.
> > 
> > My personal intention is simple: have the jobs leaked like in v4,
> > without UAF or faults and the like.
> > 
> > > 
> > > Lets have a look in drm_mock_sched_fini() and see that it is
> > > pretty
> > > straightforward. First I'll pre-process and annotate the normal
> > > version:
> > > 
> > > void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> > > {
> > > 	struct drm_mock_sched_job *job, *next;
> > > 	struct kunit  *test = sched->test;
> > > 	unsigned long flags;
> > > 	LIST_HEAD(signal);
> > > 
> > > 	// Stop the scheduler workqueues so nothing further gets
> > > scheduled
> > > 	drm_sched_wqueue_stop(&sched->base);
> > > 
> > > 	// Idle the mock scheduler in-flight jobs:
> > > 
> > > 	// First move them to a local list so any in parallel
> > > "hardware"
> > > activity does not see them any more
> > > 
> > > 	spin_lock_irqsave(&sched->lock, flags);
> > > 	list_for_each_entry_safe(job, next, &sched->job_list,
> > > link)
> > > 		list_move_tail(&job->link, &signal);
> > > 	spin_unlock_irqrestore(&sched->lock, flags);
> > > 
> > > 	// Now for jobs which "hardware" hasn't processed yet we
> > > do:
> > > 	//
> > > 	// 1. Cancel the completion timer
> > > 	// 2. Mark the job as complete (signal the hw fence,
> > > wake up
> > > waiters)
> > > 	// 3. Release the hardware fence (list had a reference)
> > > 	// 4. Free the job itself
> > > 
> > > 	list_for_each_entry(job, &signal, link) {
> > > 		hrtimer_cancel(&job->timer);
> > > 		drm_mock_sched_job_complete(job);
> > > 		dma_fence_put(&job->hw_fence);
> > > 		drm_sched_job_cleanup(&job->base);
> > > 	}
> > > 
> > > 	// Finally drm_sched_fini
> > > 	drm_sched_fini(&sched->base);
> > > }
> > > 
> > > In aspirational mode it becomes this:
> > > 
> > > void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
> > > {
> > > 	struct kunit  *test = sched->test;
> > > 
> > > 	drm_sched_fini(&sched->base);
> > > 
> > > 	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
> > > 	KUNIT_ASSERT_TRUE(test, list_empty(&sched-
> > > > base.pending_list));
> > > }
> > > 
> > > So relies on drm_sched_fini to cleanup _everything_. Including
> > > the
> > > driver (mock scheduler) state. Which is what I understood you
> > > wanted.
> > 
> > Thx for the detailed explanation. _Everything_ seems a lot, though.
> > Even in this version, you still stop submission to your entities,
> > don't
> > you?
> > 
> > IOW: is drm_sched_fini(), in aspirational mode, responsible for
> > more
> > than preventing the leaks (and for what it is already doing in
> > mainline
> > linux)?
> 
> Yes it is, it would need to idle and free all driver state. Again,
> this 
> is I understood you wanted to work towards, but as this thread will 
> conclude as dropping it for now it doesn't matter.
> 
> > > For example drm_sched_basic_entity_cleanup test case will exit
> > > (kill
> > > the
> > > entities and tear down the scheduler) with half of the submitted
> > > jobs
> > > still in flight. That's one example of what will be caught here
> > > by
> > > the
> > > asserts and also UAF since kunit test will just exit and free up
> > > memory
> > > regardless.
> > 
> > Hmm. OK, so kunit does the free. But who does the access?
> > drm_sched_fini() still stops all the work items.
> 
> In-flight jobs. Completion timers will fire, simulating hardware 
> interrupts firing with a real driver.

Ah well, no, that I hadn't in mind ;)

I'll simply use your v4 to test my leak fixes. That'll be sufficient
for now.

> 
> > > If this isn't what you had in mind you can either a) tell me what
> > > you
> > > want maybe I can tweak it*, or b) we can drop the aspirational
> > > patch
> > > for
> > > now.
> > 
> > 
> > So, here's the bad news. I had a chat with Danilo yesterday who
> > reminded me of the fact that there are testers out there (like the
> > Intel bots) which do random kconfigs - so avoiding that with
> > another
> > kconfig option does not solve the problem.
> > 
> > Thus, I think the way to go is to indeed drop aspirational mode and
> > whoever works on Schrödinger-Problems (like myself) has to tweak
> > the
> > tests locally.
> > 
> > That's probably even more productive process-wise, otherwise we'd
> > have
> > to remove broken aspects of (or the entire) aspirational mode every
> > time that we solved a known problem.
> > 
> > Unless you have a better proposal, let's drop that mode.
> > 
> > 
> > Sorry for the zig-zag, but that wasn't on my / our radar
> 
> No worries. Can you review the v7 as is or you want me to send v8? I
> am 
> pretty sure aspirational mode patch is standalone and could be simply
> dropped when merging.

Send a v8 please, that's easier. Don't forget Christian's ACK for the
appropriate ones

Thank you
P.

> 
> Regards,
> 
> Tvrtko
> 
> > > *)
> > > For example it would be possible to avoid UAFs by moving away
> > > from
> > > kunit
> > > managed allocations and just leak memory. Plus, it would also be
> > > required to cancel the timers or so.
> > > 
> > > For running in the VM or UML cases, which I thought were typical,
> > > it
> > > wouldn't be a benefit, but if you are worried about running on
> > > the
> > > host
> > > kernel and expect no crashes, then that aspect would need to
> > > change.
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > 
>