Re: [RFC PATCH v2 0/4] Improve stack unwind on ppc64

Aditya Gupta <adityag@xxxxxxxxxxxxx> · Fri, 8 Sep 2023 14:50:48 +0530

Hello lijiang,

On Tue, Sep 05, 2023 at 10:09:44AM +0800, lijiang wrote:
> On Mon, Sep 4, 2023 at 7:38 PM Aditya Gupta <adityag@xxxxxxxxxxxxx> wrote:
> 
> >
> > ...
> >
> > I did not do it, but logically that might help on x86_64.
> > I have put a diagram in patch #2.
> >
> 
> Thank you for the explanation, Aditya. That sounds good if it's doable
> on other arches.
> 

Yes, and if there is an attempt, we are open to assist with our findings.

> 
> >
> >
> > Oh, sure I will test this, meanwhile are you sure 'info rv' was the
> > command you
> > ran, I guess that's not a valid gdb command, can you please check.
> >
> >
> Ah, I misunderstood, I would like to display the result of a specified
> local variable such as "rv"(not all local variables).
> For now I get the result of "info locals" command as below:
> 
> gdb> frame 7
> #7  proc_reg_write (file=<optimized out>, buf=<optimized out>,
> count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:352
> 352                     rv = pde_write(pde, file, buf, count, ppos);
> gdb> info locals
> pde = 0xc00000000556fcc0
> rv = -5
> gdb>
> 
> But the "info variables" command still hangs:
> gdb> info variables
>  -- MORE --  forward: <SPACE>, <ENTER> or j  backward: b or k  quit:
> q...skipping...
>  -- MORE --  forward: <SPACE>, <ENTER> or j  backward: b or k  quit:
> q...skipping...
> 
> Anyway I'm not sure if this can support other "info" subcommands. Just
> tried it.
> 

Oh okay, got it.

Since very limited functionality was there in gdb mode, this patch series
focused mainly on being able to print function arguments (bt), local variables
(info locals), registers (info registers) and fixing the gdb mode.

Will fix the other 'info' subcommands in future patches, we will try to fix
the gdb mode sequentially.

> 
> 
> > >
> > > Known Issues:
> > > > =============
> > > >
> > > > 1. In gdb mode, 'info threads' might hang for few seconds, and print
> > only 2
> > > >    threads
> > > >
> > >
> > > Hmm, it only prints 2 threads, and one of which is unavailable on my
> > side.
> > > Can you try to dig into the details?
> > >
> >
> > Yes, the long time is due to gdb trying lot of unwinders, to unwind frame
> > in each
> > thread.
> > This happens in gdb mode, and does not affect the default crash mode in
> > any way.
> > And without this patch set also, gdb mode didn't recognise actual threads,
> > and
> > would just print n threads (n being number of cpus), since that's what was
> > added
> > in crash_target_init
> >
> > I have tried to fix this earlier, but failed to. The following is my
> > speculation, but this might have been caused due to crash explicitly having
> > registered threads in crash_target_init, and since not all threads will be
> > alive, gdb is trying to unwind them and failing, so it goes and tries all
> > unwinders, and still fail
> >
> >
> It seems worth looking into the details.
> 
> The current thread may be a panic thread by default. That would be more
> valuable if it supports showing the results of local variables or
> arguments(furthermore backtrace) in a panic thread.

Yes setting by default to panic thread does seem more useful, and with these
patches, function arguments and info locals works.

Currently we just set the thread 0 by default in gdb (in
'crash_target_init').

Actually setting current thread to panic thread by default was the TODO left.
I have completed it, and have the patch ready, but due to being
busy with other patches, I am not able to work on that.

> 
>  # ./crash vmlinux /var/crash/127.0.0.1-2023-09-03-23\:17\:53/vmcore
> crash 8.0.3++
> ...
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> 
>       KERNEL: vmlinux
>     DUMPFILE: /var/crash/127.0.0.1-2023-09-03-23:17:53/vmcore
>         CPUS: 8
>         DATE: Sun Sep  3 23:17:17 EDT 2023
>       UPTIME: 2 days, 18:28:40
> LOAD AVERAGE: 0.61, 0.22, 0.08
>        TASKS: 173
>     NODENAME: ibm-p9z-16-lp9.khw3.lab.eng.bos.redhat.com
>      RELEASE: 6.5.0+
>      VERSION: #1 SMP Fri Sep  1 04:07:47 EDT 2023
>      MACHINE: ppc64le  (2800 Mhz)
>       MEMORY: 16 GB
>        PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>          PID: 1453
>      COMMAND: "bash"
>         TASK: c00000000845b180  [THREAD_INFO: c00000000845b180]
>          CPU: 0
>        STATE: TASK_RUNNING (PANIC)
> 
> crash> bt
> PID: 1453     TASK: c00000000845b180  CPU: 0    COMMAND: "bash"
>  R0:  c00000000014e018    R1:  c00000003dbd7930    R2:  c0000000015a2500
>  R3:  c00000003dbd7928    R4:  c00000000845b180    R5:  0000000000000020
>  R6:  c000000002e32500    R7:  0000000000000000    R8:  0000000000000001
>  R9:  c00000000b6e1000    R10: 0000000000000000    R11: 0000000000000001
>  R12: 0000000000000000    R13: c000000002f60000    R14: 0000000000000000
>  R15: 0000000000000000    R16: 0000000000000000    R17: 0000000000000000
>  R18: 0000000000000000    R19: 0000000000000000    R20: 0000000000000000
>  R21: 0000000000000000    R22: 0000000000000000    R23: 0000000000000000
>  R24: 0000000000000000    R25: c000000001110920    R26: 0000000000000000
>  R27: c00000000276e510    R28: c000000002d1a660    R29: c000000002d1a698
>  R30: c000000002c62500    R31: c00000003dbd7958
>  NIP: c0000000002843f8    MSR: 8000000000009033    OR3: 0000000000000000
>  CTR: 000000000074d2f4    LR:  c00000000014e018    XER: 0000000020040005
>  CCR: 0000000028422282    MQ:  0000000000000001    DAR: c000000002d1a660
>  DSISR: c000000002d1a698     Syscall Result: 0000000000000001
>  [NIP  : __crash_kexec+248]
>  [LR   : panic+412]
>  #0 [c00000003dbd7930] __crash_kexec at c0000000002843f8
>  #1 [c00000003dbd7af0] panic at c00000000014e018
>  #2 [c00000003dbd7b90] sysrq_handle_crash at c0000000009b8978
>  #3 [c00000003dbd7bf0] __handle_sysrq at c0000000009b946c
>  #4 [c00000003dbd7c90] write_sysrq_trigger at c0000000009b9ce8
>  #5 [c00000003dbd7cd0] proc_reg_write at c0000000006919fc
>  #6 [c00000003dbd7d00] vfs_write at c0000000005b7cb8
>  #7 [c00000003dbd7dc0] ksys_write at c0000000005b83a4
>  #8 [c00000003dbd7e10] system_call_exception at c000000000031454
>  #9 [c00000003dbd7e50] system_call_vectored_common at c00000000000cedc
> crash> set gdb on
> gdb: on
> gdb> info thread
>   Id   Target Id         Frame
> * 1    CPU 0             <unavailable> in ?? ()
>   2    CPU 1
> gdb> bt
> #0  <unavailable> in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> gdb> thread 1
> [Switching to thread 1 (CPU 0)]
> #0  <unavailable> in ?? ()
> gdb> bt
> #0  <unavailable> in ?? ()
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> 
> 
> Although it only displays two threads, still supports switching to other
> threads:
> 
> gdb> thread 6
> [Switching to thread 6 (CPU 5)]
> #0  0xc0000000002843f8 in crash_setup_regs (oldregs=<optimized out>,
> newregs=0xc00000003dbd7958) at ./arch/powerpc/include/asm/kexec.h:69
> 69                      ppc_save_regs(newregs);
> gdb>
> 

Yes, it supports switching to all threads, unless they are exited, including the
panic thread.

> 
> For live debugging, I only see one thread, and I can not switch to other
> threads. They have different behaviors.
> 
> gdb> info thread
>   Id   Target Id         Frame
> * 1    CPU 0             <unavailable> in ?? ()
> gdb> thread 6
> gdb: gdb request failed: thread 6
> gdb>
> 

The 'info threads' issue is known, but the problem of not being able to switch
to other threads, can be due to gdb's limited ability to work on live system
debugging, even with crash's help, and is worth exploring more.

Thanks for trying the gdb mode with these patches though :)

Thanks,
Aditya Gupta

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/crash-utility
Contribution Guidelines: https://github.com/crash-utility/crash/wiki