On Thu, 2025-03-13 at 10:15 +0000, Tvrtko Ursulin wrote: > > On 13/03/2025 08:34, Philipp Stanner wrote: > > On Wed, 2025-03-12 at 14:33 +0000, Tvrtko Ursulin wrote: > > > > > > On 12/03/2025 13:49, Philipp Stanner wrote: > > > > > > [snip] > > > > > > > > > > There I see a UAF. Do you have an idea what that might > > > > > > > be? I > > > > > > > would only > > > > > > > expect memory leaks and the test assertions failing. > > > > > > > > > > > > It is expected all hell to break loose in aspirational > > > > > > mode. > > > > > > There > > > > > > the > > > > > > mock scheduler shutdown relies solely on drm_sched_fini. So > > > > > > all > > > > > > tests > > > > > > which do not fully idle the mock scheduler themselves, > > > > > > whether > > > > > > implicity > > > > > > or explicitly, will explode in various ways. > > > > > > > > > > To clarify I made it like that because that is what I > > > > > understood > > > > > was > > > > > your vision for drm_sched_fini. That it would be able to > > > > > clean > > > > > everything up etc. > > > > > > > > > > If you meant something different we need to adjust it. > > > > > Although > > > > > it > > > > > could > > > > > be done later as well. > > > > > > > > My primary intention was to use those tests to see whether my > > > > memory > > > > leak fix works or not. > > > > > > Okay here I don't know exactly what are you changing and how you > > > are > > > testing to comment. > > > > > > > Thus I'd need something like in v4, where the tests run as > > > > intended > > > > but > > > > leak the jobs in the pending_list. Would indeed be neat if they > > > > also > > > > check for list_len(pending_list) so that you don't have to run > > > > kmemleak > > > > manually. > > > > > > > > And since we shouldn't have such leaks in production tests the > > > > idea > > > > of > > > > optionally enabling them arose. > > > > > > > > More generally speaking, I think that the tests should behave > > > > as we > > > > expect drivers to behave *currently*, while we have the option > > > > for > > > > the > > > > tests to purposefully misbehave. In my case this means: just > > > > leak > > > > the > > > > jobs and optionally tell about the leaks. > > > > > > > > When used as intended, the scheduler currently AFAIK doesn't > > > > have > > > > UAF > > > > or nullptrderefs, so they shouldn't occur in tests that use it > > > > as > > > > intended. Now the leaks are special because we never intended > > > > them. > > > > > > > > So what surprised me is that, when compared to v4, this v7 here > > > > actually now also has UAF. Besides not cleaning up leaks, what > > > > are > > > > you > > > > doing in aspiritional mode? > > > > > > > > Let's talk it through before you invest your time into v8 > > > > > > 1) > > > > > > "Normal" mode is fine, do you agree? No leaks, no UAF. Ie. it is > > > exercising the scheduler how drivers are supposed to use it > > > today. > > > > Yes, it is fine. > > > > > > > > 2) > > > > > > Aspirational mode I added on your request and you can define how > > > you > > > want it to work. > > > > My personal intention is simple: have the jobs leaked like in v4, > > without UAF or faults and the like. > > > > > > > > Lets have a look in drm_mock_sched_fini() and see that it is > > > pretty > > > straightforward. First I'll pre-process and annotate the normal > > > version: > > > > > > void drm_mock_sched_fini(struct drm_mock_scheduler *sched) > > > { > > > struct drm_mock_sched_job *job, *next; > > > struct kunit *test = sched->test; > > > unsigned long flags; > > > LIST_HEAD(signal); > > > > > > // Stop the scheduler workqueues so nothing further gets > > > scheduled > > > drm_sched_wqueue_stop(&sched->base); > > > > > > // Idle the mock scheduler in-flight jobs: > > > > > > // First move them to a local list so any in parallel > > > "hardware" > > > activity does not see them any more > > > > > > spin_lock_irqsave(&sched->lock, flags); > > > list_for_each_entry_safe(job, next, &sched->job_list, > > > link) > > > list_move_tail(&job->link, &signal); > > > spin_unlock_irqrestore(&sched->lock, flags); > > > > > > // Now for jobs which "hardware" hasn't processed yet we > > > do: > > > // > > > // 1. Cancel the completion timer > > > // 2. Mark the job as complete (signal the hw fence, > > > wake up > > > waiters) > > > // 3. Release the hardware fence (list had a reference) > > > // 4. Free the job itself > > > > > > list_for_each_entry(job, &signal, link) { > > > hrtimer_cancel(&job->timer); > > > drm_mock_sched_job_complete(job); > > > dma_fence_put(&job->hw_fence); > > > drm_sched_job_cleanup(&job->base); > > > } > > > > > > // Finally drm_sched_fini > > > drm_sched_fini(&sched->base); > > > } > > > > > > In aspirational mode it becomes this: > > > > > > void drm_mock_sched_fini(struct drm_mock_scheduler *sched) > > > { > > > struct kunit *test = sched->test; > > > > > > drm_sched_fini(&sched->base); > > > > > > KUNIT_ASSERT_TRUE(test, list_empty(&sched->job_list)); > > > KUNIT_ASSERT_TRUE(test, list_empty(&sched- > > > > base.pending_list)); > > > } > > > > > > So relies on drm_sched_fini to cleanup _everything_. Including > > > the > > > driver (mock scheduler) state. Which is what I understood you > > > wanted. > > > > Thx for the detailed explanation. _Everything_ seems a lot, though. > > Even in this version, you still stop submission to your entities, > > don't > > you? > > > > IOW: is drm_sched_fini(), in aspirational mode, responsible for > > more > > than preventing the leaks (and for what it is already doing in > > mainline > > linux)? > > Yes it is, it would need to idle and free all driver state. Again, > this > is I understood you wanted to work towards, but as this thread will > conclude as dropping it for now it doesn't matter. > > > > For example drm_sched_basic_entity_cleanup test case will exit > > > (kill > > > the > > > entities and tear down the scheduler) with half of the submitted > > > jobs > > > still in flight. That's one example of what will be caught here > > > by > > > the > > > asserts and also UAF since kunit test will just exit and free up > > > memory > > > regardless. > > > > Hmm. OK, so kunit does the free. But who does the access? > > drm_sched_fini() still stops all the work items. > > In-flight jobs. Completion timers will fire, simulating hardware > interrupts firing with a real driver. Ah well, no, that I hadn't in mind ;) I'll simply use your v4 to test my leak fixes. That'll be sufficient for now. > > > > If this isn't what you had in mind you can either a) tell me what > > > you > > > want maybe I can tweak it*, or b) we can drop the aspirational > > > patch > > > for > > > now. > > > > > > So, here's the bad news. I had a chat with Danilo yesterday who > > reminded me of the fact that there are testers out there (like the > > Intel bots) which do random kconfigs - so avoiding that with > > another > > kconfig option does not solve the problem. > > > > Thus, I think the way to go is to indeed drop aspirational mode and > > whoever works on Schrödinger-Problems (like myself) has to tweak > > the > > tests locally. > > > > That's probably even more productive process-wise, otherwise we'd > > have > > to remove broken aspects of (or the entire) aspirational mode every > > time that we solved a known problem. > > > > Unless you have a better proposal, let's drop that mode. > > > > > > Sorry for the zig-zag, but that wasn't on my / our radar > > No worries. Can you review the v7 as is or you want me to send v8? I > am > pretty sure aspirational mode patch is standalone and could be simply > dropped when merging. Send a v8 please, that's easier. Don't forget Christian's ACK for the appropriate ones Thank you P. > > Regards, > > Tvrtko > > > > *) > > > For example it would be possible to avoid UAFs by moving away > > > from > > > kunit > > > managed allocations and just leak memory. Plus, it would also be > > > required to cancel the timers or so. > > > > > > For running in the VM or UML cases, which I thought were typical, > > > it > > > wouldn't be a benefit, but if you are worried about running on > > > the > > > host > > > kernel and expect no crashes, then that aspect would need to > > > change. > > > > > > Regards, > > > > > > Tvrtko > > > > > >