Re: [PATCH v7 0/7] DRM scheduler kunit tests

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxx> · Wed, 12 Mar 2025 14:33:41 +0000

On 12/03/2025 13:49, Philipp Stanner wrote:

[snip]

There I see a UAF. Do you have an idea what that might be? I
would only
expect memory leaks and the test assertions failing.

It is expected all hell to break loose in aspirational mode. There
the
mock scheduler shutdown relies solely on drm_sched_fini. So all
tests
which do not fully idle the mock scheduler themselves, whether
implicity
or explicitly, will explode in various ways.

To clarify I made it like that because that is what I understood was
your vision for drm_sched_fini. That it would be able to clean
everything up etc.

If you meant something different we need to adjust it. Although it
could
be done later as well.

My primary intention was to use those tests to see whether my memory
leak fix works or not.

Okay here I don't know exactly what are you changing and how you are 
testing to comment.

Thus I'd need something like in v4, where the tests run as intended but
leak the jobs in the pending_list. Would indeed be neat if they also
check for list_len(pending_list) so that you don't have to run kmemleak
manually.

And since we shouldn't have such leaks in production tests the idea of
optionally enabling them arose.

More generally speaking, I think that the tests should behave as we
expect drivers to behave *currently*, while we have the option for the
tests to purposefully misbehave. In my case this means: just leak the
jobs and optionally tell about the leaks.

When used as intended, the scheduler currently AFAIK doesn't have UAF
or nullptrderefs, so they shouldn't occur in tests that use it as
intended. Now the leaks are special because we never intended them.

So what surprised me is that, when compared to v4, this v7 here
actually now also has UAF. Besides not cleaning up leaks, what are you
doing in aspiritional mode?

Let's talk it through before you invest your time into v8

1)

"Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is 
exercising the scheduler how drivers are supposed to use it today.

2)

Aspirational mode I added on your request and you can define how you 
want it to work.

Lets have a look in drm_mock_sched_fini() and see that it is pretty 
straightforward. First I'll pre-process and annotate the normal version:

void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
{
	struct drm_mock_sched_job *job, *next;
	struct kunit  *test = sched->test;
	unsigned long flags;
	LIST_HEAD(signal);

	// Stop the scheduler workqueues so nothing further gets scheduled
	drm_sched_wqueue_stop(&sched->base);

	// Idle the mock scheduler in-flight jobs:

	// First move them to a local list so any in parallel "hardware" 
activity does not see them any more

	spin_lock_irqsave(&sched->lock, flags);
	list_for_each_entry_safe(job, next, &sched->job_list, link)
		list_move_tail(&job->link, &signal);
	spin_unlock_irqrestore(&sched->lock, flags);

	// Now for jobs which "hardware" hasn't processed yet we do:
	//
	// 1. Cancel the completion timer
	// 2. Mark the job as complete (signal the hw fence, wake up waiters)
	// 3. Release the hardware fence (list had a reference)
	// 4. Free the job itself

	list_for_each_entry(job, &signal, link) {
		hrtimer_cancel(&job->timer);
		drm_mock_sched_job_complete(job);
		dma_fence_put(&job->hw_fence);
		drm_sched_job_cleanup(&job->base);
	}

	// Finally drm_sched_fini
	drm_sched_fini(&sched->base);
}

In aspirational mode it becomes this:

void drm_mock_sched_fini(struct drm_mock_scheduler *sched)
{
	struct kunit  *test = sched->test;

	drm_sched_fini(&sched->base);

	KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list));
	KUNIT_ASSERT_TRUE(test, list_empty(&sched->base.pending_list));
}

So relies on drm_sched_fini to cleanup _everything_. Including the 
driver (mock scheduler) state. Which is what I understood you wanted.

For example drm_sched_basic_entity_cleanup test case will exit (kill the 
entities and tear down the scheduler) with half of the submitted jobs 
still in flight. That's one example of what will be caught here by the 
asserts and also UAF since kunit test will just exit and free up memory 
regardless.

If this isn't what you had in mind you can either a) tell me what you 
want maybe I can tweak it*, or b) we can drop the aspirational patch for 
now.

*)
For example it would be possible to avoid UAFs by moving away from kunit 
managed allocations and just leak memory. Plus, it would also be 
required to cancel the timers or so.

For running in the VM or UML cases, which I thought were typical, it 
wouldn't be a benefit, but if you are worried about running on the host 
kernel and expect no crashes, then that aspect would need to change.

Regards,

Tvrtko