On Wed, 2025-03-12 at 14:33 +0000, Tvrtko Ursulin wrote: > > On 12/03/2025 13:49, Philipp Stanner wrote: > > [snip] > > > > > > There I see a UAF. Do you have an idea what that might be? I > > > > > would only > > > > > expect memory leaks and the test assertions failing. > > > > > > > > It is expected all hell to break loose in aspirational mode. > > > > There > > > > the > > > > mock scheduler shutdown relies solely on drm_sched_fini. So all > > > > tests > > > > which do not fully idle the mock scheduler themselves, whether > > > > implicity > > > > or explicitly, will explode in various ways. > > > > > > To clarify I made it like that because that is what I understood > > > was > > > your vision for drm_sched_fini. That it would be able to clean > > > everything up etc. > > > > > > If you meant something different we need to adjust it. Although > > > it > > > could > > > be done later as well. > > > > My primary intention was to use those tests to see whether my > > memory > > leak fix works or not. > > Okay here I don't know exactly what are you changing and how you are > testing to comment. > > > Thus I'd need something like in v4, where the tests run as intended > > but > > leak the jobs in the pending_list. Would indeed be neat if they > > also > > check for list_len(pending_list) so that you don't have to run > > kmemleak > > manually. > > > > And since we shouldn't have such leaks in production tests the idea > > of > > optionally enabling them arose. > > > > More generally speaking, I think that the tests should behave as we > > expect drivers to behave *currently*, while we have the option for > > the > > tests to purposefully misbehave. In my case this means: just leak > > the > > jobs and optionally tell about the leaks. > > > > When used as intended, the scheduler currently AFAIK doesn't have > > UAF > > or nullptrderefs, so they shouldn't occur in tests that use it as > > intended. Now the leaks are special because we never intended them. > > > > So what surprised me is that, when compared to v4, this v7 here > > actually now also has UAF. Besides not cleaning up leaks, what are > > you > > doing in aspiritional mode? > > > > Let's talk it through before you invest your time into v8 > > 1) > > "Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is > exercising the scheduler how drivers are supposed to use it today. Yes, it is fine. > > 2) > > Aspirational mode I added on your request and you can define how you > want it to work. My personal intention is simple: have the jobs leaked like in v4, without UAF or faults and the like. > > Lets have a look in drm_mock_sched_fini() and see that it is pretty > straightforward. First I'll pre-process and annotate the normal > version: > > void drm_mock_sched_fini(struct drm_mock_scheduler *sched) > { > struct drm_mock_sched_job *job, *next; > struct kunit *test = sched->test; > unsigned long flags; > LIST_HEAD(signal); > > // Stop the scheduler workqueues so nothing further gets > scheduled > drm_sched_wqueue_stop(&sched->base); > > // Idle the mock scheduler in-flight jobs: > > // First move them to a local list so any in parallel > "hardware" > activity does not see them any more > > spin_lock_irqsave(&sched->lock, flags); > list_for_each_entry_safe(job, next, &sched->job_list, link) > list_move_tail(&job->link, &signal); > spin_unlock_irqrestore(&sched->lock, flags); > > // Now for jobs which "hardware" hasn't processed yet we do: > // > // 1. Cancel the completion timer > // 2. Mark the job as complete (signal the hw fence, wake up > waiters) > // 3. Release the hardware fence (list had a reference) > // 4. Free the job itself > > list_for_each_entry(job, &signal, link) { > hrtimer_cancel(&job->timer); > drm_mock_sched_job_complete(job); > dma_fence_put(&job->hw_fence); > drm_sched_job_cleanup(&job->base); > } > > // Finally drm_sched_fini > drm_sched_fini(&sched->base); > } > > In aspirational mode it becomes this: > > void drm_mock_sched_fini(struct drm_mock_scheduler *sched) > { > struct kunit *test = sched->test; > > drm_sched_fini(&sched->base); > > KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list)); > KUNIT_ASSERT_TRUE(test, list_empty(&sched- > >base.pending_list)); > } > > So relies on drm_sched_fini to cleanup _everything_. Including the > driver (mock scheduler) state. Which is what I understood you wanted. Thx for the detailed explanation. _Everything_ seems a lot, though. Even in this version, you still stop submission to your entities, don't you? IOW: is drm_sched_fini(), in aspirational mode, responsible for more than preventing the leaks (and for what it is already doing in mainline linux)? > > For example drm_sched_basic_entity_cleanup test case will exit (kill > the > entities and tear down the scheduler) with half of the submitted jobs > still in flight. That's one example of what will be caught here by > the > asserts and also UAF since kunit test will just exit and free up > memory > regardless. Hmm. OK, so kunit does the free. But who does the access? drm_sched_fini() still stops all the work items. > > If this isn't what you had in mind you can either a) tell me what you > want maybe I can tweak it*, or b) we can drop the aspirational patch > for > now. So, here's the bad news. I had a chat with Danilo yesterday who reminded me of the fact that there are testers out there (like the Intel bots) which do random kconfigs - so avoiding that with another kconfig option does not solve the problem. Thus, I think the way to go is to indeed drop aspirational mode and whoever works on Schrödinger-Problems (like myself) has to tweak the tests locally. That's probably even more productive process-wise, otherwise we'd have to remove broken aspects of (or the entire) aspirational mode every time that we solved a known problem. Unless you have a better proposal, let's drop that mode. Sorry for the zig-zag, but that wasn't on my / our radar P. > > *) > For example it would be possible to avoid UAFs by moving away from > kunit > managed allocations and just leak memory. Plus, it would also be > required to cancel the timers or so. > > For running in the VM or UML cases, which I thought were typical, it > wouldn't be a benefit, but if you are worried about running on the > host > kernel and expect no crashes, then that aspect would need to change. > > Regards, > > Tvrtko >