Re: [git pull] drm for 6.1-rc1

Christian König <christian.koenig@xxxxxxx> · Fri, 7 Oct 2022 08:11:26 +0200

Am 07.10.22 um 04:45 schrieb Dave Airlie:
On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied@xxxxxxxxx> wrote:

[ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
[ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
As far as I can tell, that's the line

         struct drm_gpu_scheduler *sched = s_fence->sched;

where 's_fence' is NULL. The code is

    0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
    5: 41 54                push   %r12
    7: 55                    push   %rbp
    8: 53                    push   %rbx
    9: 48 89 fb              mov    %rdi,%rbx
    c:* 48 8b af 88 00 00 00 mov    0x88(%rdi),%rbp <-- trapping instruction
   13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
   1a: 48 8b 85 80 01 00 00 mov    0x180(%rbp),%rax

and that next 'lock decl' instruction would have been the

         atomic_dec(&sched->hw_rq_count);

at the top of drm_sched_job_done().

Now, as to *why* you'd have a NULL s_fence, it would seem that
drm_sched_job_cleanup() was called with an active job. Looking at that
code, it does

         if (kref_read(&job->s_fence->finished.refcount)) {
                 /* drm_sched_job_arm() has been called */
                 dma_fence_put(&job->s_fence->finished);
         ...

but then it does

         job->s_fence = NULL;

anyway, despite the job still being active. The logic of that kind of
"fake refcount" escapes me. The above looks fundamentally racy, not to
say pointless and wrong (a refcount is a _count_, not a flag, so there
could be multiple references to it, what says that you can just
decrement one of them and say "I'm done").

Now, _why_ any of that happens, I have no idea. I'm just looking at
the immediate "that pointer is NULL" thing, and reacting to what looks
like a completely bogus refcount pattern.

But that odd refcount pattern isn't new, so it's presumably some user
on the amd gpu side that changed.

The problem hasn't happened again for me, but that's not saying a lot,
since it was very random to begin with.
I chased down the culprit to a drm sched patch, I'll send you a pull
with a revert in it.

commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
Author: Arvind Yadav <Arvind.Yadav@xxxxxxx>
Date:   Wed Sep 14 22:13:20 2022 +0530

     drm/sched: Use parent fence instead of finished

     Using the parent fence instead of the finished fence
     to get the job status. This change is to avoid GPU
     scheduler timeout error which can cause GPU reset.

     Signed-off-by: Arvind Yadav <Arvind.Yadav@xxxxxxx>
     Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
     Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2Fmsgid%2F20220914164321.2156-6-Arvind.Yadav%40amd.com&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C516db37183e84489e1aa08daa80e087e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638007075495101336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=JWT8R205jIPQu87K7a1T0UJ0iKNO8smHhosijAA0%2BNk%3D&amp;reserved=0
     Signed-off-by: Christian König <christian.koenig@xxxxxxx>

I'll let Arvind and Christian maybe work out what is going wrong there.

That's a known issue Arvind is already investigating for a while.

Any idea how you triggered it on boot? We have only be able to trigger 
it very sporadic.

Reverting the patch for now sounds like a good idea to me, it's only a 
cleanup anyway.

Thanks,
Christian.

Dave.

                  Linus