Forgot to add Andrey as scheduler maintainer. -Daniel On Fri, 7 Oct 2022 at 10:16, Daniel Vetter <daniel.vetter@xxxxxxxx> wrote: > > On Fri, 7 Oct 2022 at 01:45, Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@xxxxxxxxx> wrote: > > > > > > > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched] > > > > As far as I can tell, that's the line > > > > struct drm_gpu_scheduler *sched = s_fence->sched; > > > > where 's_fence' is NULL. The code is > > > > 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > 5: 41 54 push %r12 > > 7: 55 push %rbp > > 8: 53 push %rbx > > 9: 48 89 fb mov %rdi,%rbx > > c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction > > 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) > > 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax > > > > and that next 'lock decl' instruction would have been the > > > > atomic_dec(&sched->hw_rq_count); > > > > at the top of drm_sched_job_done(). > > > > Now, as to *why* you'd have a NULL s_fence, it would seem that > > drm_sched_job_cleanup() was called with an active job. Looking at that > > code, it does > > > > if (kref_read(&job->s_fence->finished.refcount)) { > > /* drm_sched_job_arm() has been called */ > > dma_fence_put(&job->s_fence->finished); > > ... > > > > but then it does > > > > job->s_fence = NULL; > > > > anyway, despite the job still being active. The logic of that kind of > > "fake refcount" escapes me. The above looks fundamentally racy, not to > > say pointless and wrong (a refcount is a _count_, not a flag, so there > > could be multiple references to it, what says that you can just > > decrement one of them and say "I'm done"). > > Just figured I'll clarify this, because it's indeed a bit wtf and the > comment doesn't explain much. drm_sched_job_cleanup can be called both > when a real job is being cleaned up (which holds a full reference on > job->s_fence and needs to drop it) and to simplify error path in job > constructions (and the "is this refcount initialized already" signals > what exactly needs to be cleaned up or not). So no race, because the > only times this check goes different is when job construction has > failed before the job struct is visible by any other thread. > > But yeah the comment could actually explain what's going on here :-) > > And yeah the patch Dave reverted screws up the cascade of references > that ensures this all stays alive until drm_sched_job_cleanup is > called on active jobs, so looks all reasonable to me. Some Kunit tests > maybe to exercise these corners? Not the first time pure scheduler > code blew up, so proably worth the effort. > -Daniel > > > > > Now, _why_ any of that happens, I have no idea. I'm just looking at > > the immediate "that pointer is NULL" thing, and reacting to what looks > > like a completely bogus refcount pattern. > > > > But that odd refcount pattern isn't new, so it's presumably some user > > on the amd gpu side that changed. > > > > The problem hasn't happened again for me, but that's not saying a lot, > > since it was very random to begin with. > > > > Linus > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch