Re: Kernel panic during drm/nouveau init 5.3.0-rc7-next-20190903

Daniel Vetter <daniel@xxxxxxxx> · Fri, 20 Sep 2019 21:48:17 +0200



Adding Gerd and Christian.
-Daniel

On Fri, Sep 20, 2019 at 9:44 PM Alexander Kapshuk
<alexander.kapshuk@xxxxxxxxx> wrote:
>
> On Sat, Sep 07, 2019 at 12:05:34PM +0300, Alexander Kapshuk wrote:
> > To Whom It May Concern
> >
> > Every kernel I have built since 5.3.0-rc2-next-20190730 and up to
> > 5.3.0-rc7-next-20190903 has resulted in the kernel panic described below.
> >
> > The panic occurs early on in the boot process, so no records of it get
> > written on disk. I resourted to taking photos and videos to get the info
> > for debugging.
> >
> > [Kernel panic]
> > Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
> >
> > Kernel panic - Not syncing: Attempted to kill init! exitcode=0x0000000b.
> >
> > Top of call stack:
> > __drm_fb_helper_initial_config_and_unlock
> > drm_fb_helper_initial_config
> >
> > <scripts/decodecode <~/tmp/panic_code.txt
> > Code: 00 48 83 bb f0 00 00 00 00 74 16 48 83 c3 18 b9 17 00 00 00 31 c0 48 89 df f3 48 ab 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb e1 <0f> 0b 0f 1f 40 00 55 48 89 e5 41 54 49 89 d4 53 48 89 f3 e8 7e ff
> > All code
> > ========
> >    0: 00 48 83                add    %cl,-0x7d(%rax)
> >    3: bb f0 00 00 00          mov    $0xf0,%ebx
> >    8: 00 74 16 48             add    %dh,0x48(%rsi,%rdx,1)
> >    c: 83 c3 18                add    $0x18,%ebx
> >    f: b9 17 00 00 00          mov    $0x17,%ecx
> >   14: 31 c0                   xor    %eax,%eax
> >   16: 48 89 df                mov    %rbx,%rdi
> >   19: f3 48 ab                rep stos %rax,%es:(%rdi)
> >   1c: 5b                      pop    %rbx
> >   1d: 41 5c                   pop    %r12
> >   1f: 5d                      pop    %rbp
> >   20: c3                      retq
> >   21: 4c 89 a3 f0 00 00 00    mov    %r12,0xf0(%rbx)
> >   28: eb e1                   jmp    0xb
> >   2a:*        0f 0b                   ud2             <-- trapping instruction
> >   2c: 0f 1f 40 00             nopl   0x0(%rax)
> >   30: 55                      push   %rbp
> >   31: 48 89 e5                mov    %rsp,%rbp
> >   34: 41 54                   push   %r12
> >   36: 49 89 d4                mov    %rdx,%r12
> >   39: 53                      push   %rbx
> >   3a: 48 89 f3                mov    %rsi,%rbx
> >   3d: e8                      .byte 0xe8
> >   3e: 7e ff                   jle    0x3f
> >
> > Code starting with the faulting instruction
> > ===========================================
> >    0: 0f 0b                   ud2
> >    2: 0f 1f 40 00             nopl   0x0(%rax)
> >    6: 55                      push   %rbp
> >    7: 48 89 e5                mov    %rsp,%rbp
> >    a: 41 54                   push   %r12
> >    c: 49 89 d4                mov    %rdx,%r12
> >    f: 53                      push   %rbx
> >   10: 48 89 f3                mov    %rsi,%rbx
> >   13: e8                      .byte 0xe8
> >   14: 7e ff                   jle    0x15
> >
> > The panic occurs after the 'Driver supports precise vblank timestamp
> > query.' line gets printed to console:
> > [    2.858970] Linux agpgart interface v0.103
> > [    2.859308] nouveau 0000:01:00.0: NVIDIA G84 (084300a2)
> > [    2.968950] nouveau 0000:01:00.0: bios: version 60.84.68.00.19
> > [    2.989923] nouveau 0000:01:00.0: bios: M0203T not found
> > [    2.990010] nouveau 0000:01:00.0: bios: M0203E not matched!
> > [    2.990096] nouveau 0000:01:00.0: fb: 512 MiB DDR2
> > [    3.062362] [TTM] Zone  kernel: Available graphics memory: 2015014 KiB
> > [    3.062494] [TTM] Initializing pool allocator
> > [    3.062581] [TTM] Initializing DMA pool allocator
> > [    3.062683] nouveau 0000:01:00.0: DRM: VRAM: 512 MiB
> > [    3.062769] nouveau 0000:01:00.0: DRM: GART: 1048576 MiB
> > [    3.062859] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> > [    3.062944] nouveau 0000:01:00.0: DRM: DCB version 4.0
> > [    3.063030] nouveau 0000:01:00.0: DRM: DCB outp 00: 02000300 00000028
> > [    3.063117] nouveau 0000:01:00.0: DRM: DCB outp 01: 01000302 00000030
> > [    3.063203] nouveau 0000:01:00.0: DRM: DCB outp 02: 04011310 00000028
> > [    3.063290] nouveau 0000:01:00.0: DRM: DCB outp 03: 02011312 00c000b0
> > [    3.063377] nouveau 0000:01:00.0: DRM: DCB conn 00: 1030
> > [    3.063462] nouveau 0000:01:00.0: DRM: DCB conn 01: 2130
> > [    3.065982] nouveau 0000:01:00.0: DRM: MM: using CRYPT for buffer copies
> > [    3.066622] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> > [    3.066754] [drm] Driver supports precise vblank timestamp query.
> >
> > I was not able to capture the value of RIP for this crash.
> >
> > With drm_kms_helper.fbdev_emulation=0 enabled, as documented in
> > the commentary to function drm_fb_helper_initial_config defined in
> > drivers/gpu/drm/drm_fb_helper.c, I get the following output:
> >
> > RIP: 0010: _raw_spin_lock+0x7/0x20
> > Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
> >
> > <scripts/decodecode <~/tmp/panic_code.txt
> > Code: ba ff 00 00 00 f0 0f b1 17 75 01 c3 55 48 89 e5 e8 23 a2 6d ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 01 c3 55 89 c6 40 89 e5 e8 e7 8f 6d ff 5d c3 0f 1f
> > All code
> > ========
> >    0: ba ff 00 00 00          mov    $0xff,%edx
> >    5: f0 0f b1 17             lock cmpxchg %edx,(%rdi)
> >    9: 75 01                   jne    0xc
> >    b: c3                      retq
> >    c: 55                      push   %rbp
> >    d: 48 89 e5                mov    %rsp,%rbp
> >   10: e8 23 a2 6d ff          callq  0xffffffffff6da238
> >   15: 5d                      pop    %rbp
> >   16: c3                      retq
> >   17: 66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
> >   1e: 00 00 00 00
> >   22: 90                      nop
> >   23: 31 c0                   xor    %eax,%eax
> >   25: ba 01 00 00 00          mov    $0x1,%edx
> >   2a:*        f0 0f b1 17             lock cmpxchg %edx,(%rdi)                <-- trapping instruction
> >   2e: 75 01                   jne    0x31
> >   30: c3                      retq
> >   31: 55                      push   %rbp
> >   32: 89 c6                   mov    %eax,%esi
> >   34: 40 89 e5                rex mov %esp,%ebp
> >   37: e8 e7 8f 6d ff          callq  0xffffffffff6d9023
> >   3c: 5d                      pop    %rbp
> >   3d: c3                      retq
> >   3e: 0f                      .byte 0xf
> >   3f: 1f                      (bad)
> >
> > Code starting with the faulting instruction
> > ===========================================
> >    0: f0 0f b1 17             lock cmpxchg %edx,(%rdi)
> >    4: 75 01                   jne    0x7
> >    6: c3                      retq
> >    7: 55                      push   %rbp
> >    8: 89 c6                   mov    %eax,%esi
> >    a: 40 89 e5                rex mov %esp,%ebp
> >    d: e8 e7 8f 6d ff          callq  0xffffffffff6d8ff9
> >   12: 5d                      pop    %rbp
> >   13: c3                      retq
> >   14: 0f                      .byte 0xf
> >   15: 1f                      (bad)
> >
> > (gdb) list *(_raw_spin_lock+0x7)
> > 0xffffffff81a13b27 is in _raw_spin_lock (./arch/x86/include/asm/atomic.h:200).
> > 195   }
> > 196
> > 197   #define arch_atomic_try_cmpxchg arch_atomic_try_cmpxchg
> > 198   static __always_inline bool arch_atomic_try_cmpxchg(atomic_t *v, int *old, int new)
> > 199   {
> > 200           return try_cmpxchg(&v->counter, old, new);
> > 201   }
> > 202
> > 203   static inline int arch_atomic_xchg(atomic_t *v, int new)
> > 204   {
> >
> > (gdb) disassemble _raw_spin_lock+0x7
> > Dump of assembler code for function _raw_spin_lock:
> >    0xffffffff81a13b20 <+0>:   xor    %eax,%eax
> >    0xffffffff81a13b22 <+2>:   mov    $0x1,%edx
> >    0xffffffff81a13b27 <+7>:   lock cmpxchg %edx,(%rdi)
> >    0xffffffff81a13b2b <+11>:  jne    0xffffffff81a13b2e <_raw_spin_lock+14>
> >    0xffffffff81a13b2d <+13>:  retq
> >    0xffffffff81a13b2e <+14>:  push   %rbp
> >    0xffffffff81a13b2f <+15>:  mov    %eax,%esi
> >    0xffffffff81a13b31 <+17>:  mov    %rsp,%rbp
> >    0xffffffff81a13b34 <+20>:  callq  0xffffffff810ecb20 <queued_spin_lock_slowpath>
> >    0xffffffff81a13b39 <+25>:  pop    %rbp
> >    0xffffffff81a13b3a <+26>:  retq
> > End of assembler dump.
> >
> > Any pointers on how to proceed with this would be appreciated.
>
> 'Git bisect' has identified the following commits as being 'bad'.
>
> b96f3e7c8069b749a40ca3a33c97835d57dd45d2 is the first bad commit
> commit b96f3e7c8069b749a40ca3a33c97835d57dd45d2
> Author: Gerd Hoffmann <kraxel@xxxxxxxxxx>
> Date:   Mon Aug 5 16:01:10 2019 +0200
>
>     drm/ttm: use gem vma_node
>
>     Drop vma_node from ttm_buffer_object, use the gem struct
>     (base.vma_node) instead.
>
>     Signed-off-by: Gerd Hoffmann <kraxel@xxxxxxxxxx>
>     Reviewed-by: Christian König <christian.koenig@xxxxxxx>
>     Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-9-kraxel@xxxxxxxxxx
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 2 +-
>  drivers/gpu/drm/drm_gem_vram_helper.c      | 2 +-
>  drivers/gpu/drm/nouveau/nouveau_display.c  | 2 +-
>  drivers/gpu/drm/nouveau/nouveau_gem.c      | 2 +-
>  drivers/gpu/drm/qxl/qxl_object.h           | 2 +-
>  drivers/gpu/drm/radeon/radeon_object.h     | 2 +-
>  drivers/gpu/drm/ttm/ttm_bo.c               | 8 ++++----
>  drivers/gpu/drm/ttm/ttm_bo_util.c          | 2 +-
>  drivers/gpu/drm/ttm/ttm_bo_vm.c            | 9 +++++----
>  drivers/gpu/drm/virtio/virtgpu_drv.h       | 2 +-
>  drivers/gpu/drm/virtio/virtgpu_prime.c     | 3 ---
>  drivers/gpu/drm/vmwgfx/vmwgfx_bo.c         | 4 ++--
>  drivers/gpu/drm/vmwgfx/vmwgfx_surface.c    | 4 ++--
>  include/drm/ttm/ttm_bo_api.h               | 4 ----
>  14 files changed, 21 insertions(+), 27 deletions(-)
>
> I nominated commit '[1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715] drm/ttm:
> use gem reservation object' as being 'good' initially, based on the
> fact that kernel 5.3.0-rc1-00364-g1e053b10ba60 did boot. But the GUI
> applications displayed black artifacts across the screen.
>
> I then edited the git-bisect log file where I nominated
> commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as being
> 'bad' and ran 'git bisect replay' on it. This blamed commit
> 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 as the first bad commit.
>
> 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715 is the first bad commit
> commit 1e053b10ba60eae6a3f9de64cbc74bdf6cb0e715
> Author: Gerd Hoffmann <kraxel@xxxxxxxxxx>
> Date:   Mon Aug 5 16:01:09 2019 +0200
>
>     drm/ttm: use gem reservation object
>
>     Drop ttm_resv from ttm_buffer_object, use the gem reservation object
>     (base._resv) instead.
>
>     Signed-off-by: Gerd Hoffmann <kraxel@xxxxxxxxxx>
>     Reviewed-by: Christian König <christian.koenig@xxxxxxx>
>     Link: http://patchwork.freedesktop.org/patch/msgid/20190805140119.7337-8-kraxel@xxxxxxxxxx
>
>  drivers/gpu/drm/ttm/ttm_bo.c      | 39 +++++++++++++++++++++++----------------
>  drivers/gpu/drm/ttm/ttm_bo_util.c |  2 +-
>  include/drm/ttm/ttm_bo_api.h      |  1 -
>  3 files changed, 24 insertions(+), 18 deletions(-)
>
>
> In the process of bisection, I nominated the following kernels as being
> 'bad'. They also booted fine, but the xserver would fail to start. I
> have attached the error messages generated by xorg.
>
> # kernel boots; Xorg won't start. See Xorg_err.log attached.
> 5.3.0-rc3-01537-g6a3068065fa4
> 5.3.0-rc3-00782-gb0383c0653c4
> 5.3.0-rc1-00391-g54fc01b775fe
> 5.3.0-rc1-00366-g2e3c9ec4d151
> 5.3.0-rc1-00365-gb96f3e7c8069
>
> Today, I upgraded the kernel to 5.3.0-next-20190919, which booted fine
> with no Xorg regressions to report.
>
> Just wondering if the earlier kernels would not boot for me because of
> the changes introduced by the 'bad' commits being perhaps incomplete?
>
> Thanks to all of you for the tips on how proceed with bisection.


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch