[Crash-utility] Re: [PATCH v3 0/5] Improve stack unwind on ppc64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/15/23 04:38, Aditya Gupta wrote:

Hi Lianbo,

On Thu, Dec 14, 2023 at 02:49:54PM +0800, Lianbo Jiang wrote:
On 12/13/23 22:45, Aditya Gupta wrote:

Hi Lianbo,

On Wed, Dec 13, 2023 at 03:20:40PM +0800, Lianbo Jiang wrote:
Hi, Aditya

Thank you for the v3.

I got a core dump after applying the patch[5], but I did not see the
backtrace from the core dump file. Could you please check it again?

# ./crash

crash 8.0.4++
Copyright (C) 2002-2022  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
Copyright (C) 2015, 2021  VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64le-unknown-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
      <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...

Segmentation fault (core dumped)
Thanks for trying it.

Was it 'crash-utility' which itself crashed due to the patches ?
Seems 'yes'. I did not see the same issue before applying the patchset.
That is wierd, it should not cause any segmentation fault with the
patches, can you please share the steps to reproduce this ? I will fix
it.
Sorry, I should describe the failure more details. It's easy to be
reproduced on my side.

Step 1: applying these five patches

Step 2: make lzo

Step 3-1: for live debugging, crash tool will fail to load and get the
segfault.

FYI:

If the current feature doesn't support for live debugging, need to mention
that somewhere, at least this should not fail to load.

Thanks, fixed it now. Yes, this feature doesn't support live debug, since
registers and other threads both are missing (atleast on ppc64).

Thank you for the explanation, Aditya.

On ppc64, the pt_regs is NULL, which was accessed and caused the segmentation fault.
Fixed it by skipping ppc64_get_cpu_reg call in case of live debug (return
FALSE), and even if it's a vmcore still pt_regs is NULL somehow, skipped
accessing that pt_regs and print a warning that registers are not available for
the respective cpu.

Sounds good.

For live debugging, also add a note in the patch log , and to say that it doesn't support the feature for now(if it's hard to  be implemented).


I will send it in V4.
Ok, Thanks.
Step 3-2: for kdump case(vmcore)

crash> gdb bt
#0  0xc000000000281298 in crash_setup_regs (gdb: invalid kernel virtual
address: fffffffffffffffb  type: "gdb_readmem callback"
gdb: invalid kernel virtual address: fffffffffffffff7  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff3  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffffb  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff7  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff3  type: "gdb_readmem
callback"
oldregs=<optimized out>, newregs=0xc00000000c0f7908) at
./arch/powerpc/include/asm/kexec.h:69
#1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:975
#2  0xfffffffffffffffb in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
crash> set gdb on
gdb: on
gdb> bt
#0  0xc000000000281298 in crash_setup_regs (oldregs=<optimized out>,
newregs=0xc00000000c0f7908) at ./arch/powerpc/include/asm/kexec.h:69
#1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:975
#2  0xfffffffffffffffb in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
gdb> info threads
   Id   Target Id         Frame
   1    CPU 0             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
   2    CPU 1             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
   3    CPU 2             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
   4    CPU 3             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
   5    CPU 4             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
   6    CPU 5             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
   7    CPU 6             plpar_hcall_norets_notrace () at
arch/powerpc/platforms/pseries/hvCall.S:112
* 8    CPU 7             0xc000000000281298 in crash_setup_regs (gdb:
invalid kernel virtual address: fffffffffffffffb  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff7  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff3  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffffb  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff7  type: "gdb_readmem
callback"
gdb: invalid kernel virtual address: fffffffffffffff3  type: "gdb_readmem
callback"
oldregs=<optimized out>, newregs=0xc00000000c0f7908) at
./arch/powerpc/include/asm/kexec.h:69
gdb> info locals
No locals.
gdb> info args
oldregs = <optimized out>
newregs = 0xc00000000c0f7908
gdb>
Okay, I just recalled this is a known issue with the kernel code to save
registers, and I fixed it in upstream linux, with the commit:

Sorry, I missed this information. Thank you, Aditya.

commit b684c09f09e7a6af3794d4233ef785819e72db79
Author: Aditya Gupta <adityag@xxxxxxxxxxxxx>
Date:   Thu Jun 15 14:40:47 2023 +0530

     powerpc: update ppc_save_regs to save current r1 in pt_regs

I did the test based on the latest kernel(with the above commit), and it works now.

crash> bt
PID: 4339     TASK: c000000049071880  CPU: 1    COMMAND: "bash"
 R0:  c000000000151658    R1:  c000000059dd3930    R2: c0000000015c3400
 R3:  c000000059dd3928    R4:  c000000049071880    R5: 0000000000000020
 R6:  c000000002e33400    R7:  0000000000000000    R8: 0000000000000001
 R9:  c00000005a95a800    R10: 0000000000000000    R11: 0000000000000001
 R12: 0000000000000000    R13: c00000000f7cf480    R14: 0000000000000000
 R15: 0000000000000000    R16: 0000000000000000    R17: 0000000000000000
 R18: 0000000000000000    R19: 0000000000000000    R20: 0000000000000000
 R21: 0000000000000000    R22: 0000000000000000    R23: 0000000000000000
 R24: 0000000000000000    R25: c000000001121490    R26: 0000000000000000
 R27: c00000000276d860    R28: c000000002d1a660    R29: c000000002d1a698
 R30: c000000002c63400    R31: c000000059dd3958
 NIP: c00000000028b488    MSR: 8000000000009033    OR3: 0000000000000000
 CTR: 00000000006db41c    LR:  c000000000151658    XER: 0000000020040004
 CCR: 0000000028422282    MQ:  0000000000000001    DAR: c000000002d1a660
 DSISR: c000000002d1a698     Syscall Result: 0000000000000001
 [NIP  : __crash_kexec+248]
 [LR   : panic+412]
 #0 [c000000059dd3930] __crash_kexec at c00000000028b488
 #1 [c000000059dd3af0] panic at c000000000151658
 #2 [c000000059dd3b90] sysrq_handle_crash at c0000000009d8df8
 #3 [c000000059dd3bf0] __handle_sysrq at c0000000009d9a30
 #4 [c000000059dd3c90] write_sysrq_trigger at c0000000009da1d8
 #5 [c000000059dd3cd0] proc_reg_write at c0000000006b1bcc
 #6 [c000000059dd3d00] vfs_write at c0000000005d1aa8
 #7 [c000000059dd3dc0] ksys_write at c0000000005d2184
 #8 [c000000059dd3e10] system_call_exception at c0000000000315e4
 #9 [c000000059dd3e50] system_call_vectored_common at c00000000000cedc

crash> gdb bt
#0  0xc00000000028b488 in crash_setup_regs (oldregs=<optimized out>, newregs=0xc000000059dd3958) at ./arch/powerpc/include/asm/kexec.h:69
#1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:1047
#2  0xc000000000151658 in panic (fmt=0xc0000000014628c0 "sysrq triggered crash\n") at kernel/panic.c:363 #3  0xc0000000009d8df8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:154 #4  0xc0000000009d9a30 in __handle_sysrq (key=key@entry=99 'c', check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:601 #5  0xc0000000009da1d8 in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1162 #6  0xc0000000006b1bcc in pde_write (ppos=<optimized out>, count=<optimized out>, buf=<optimized out>, file=<optimized out>, pde=0xc000000003602a00) at fs/proc/inode.c:337 #7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:349 #8  0xc0000000005d1aa8 in vfs_write (file=file@entry=0xc00000005f747800, buf=buf@entry=0x100039754b0 <error: Cannot access memory at address 0x100039754b0>, count=count@entry=2, pos=pos@entry=0xc000000059dd3de0) at fs/read_write.c:582 #9  0xc0000000005d2184 in ksys_write (fd=<optimized out>, buf=0x100039754b0 <error: Cannot access memory at address 0x100039754b0>, count=2) at fs/read_write.c:637 #10 0xc0000000000315e4 in system_call_exception (regs=0xc000000059dd3e80, r0=<optimized out>) at arch/powerpc/kernel/syscall.c:153 #11 0xc00000000000cedc in system_call_vectored_common () at arch/powerpc/kernel/interrupt_64.S:198
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
crash>

Anyway, the "gdb bt" command still got the following information:

"Backtrace stopped: previous frame inner to this frame (corrupt stack?)"

This could indicate an unwind failure due to stack corruption? But looks like a correct back trace, only slightly different from the crash bt command. This information seems strange.

I do not observe any other issues for the time being, it works well as expected, I will continue to do some tests based on your next post(with your fixes), and also let's reback to the code.


Please bear with me for the long explaination.

Basically before this commit, the coredump used to have `nip` (program counter)
pointing to the latest function call, BUT `r1` (stack pointer) pointing at
the 2nd latest function call.

And due to this, actually crash's backtrace itself is slightly wrong. crash
actually skips printing one topmost frame, and shows wrong instruction pointer
for top frame

In case of kdump, it skips 'crash_setup_regs''s activation frame, but that is
not very noticeable since it's inlined in '__crash_kexec', I will use 'fadump's'
example since it's easier to notice the issue there:

Interesting, good findings.

Coming back to gdb mode.
This is gdb's backtrace, for a fadump crash caused by 'echo c >
/proc/sysrq-trigger':

crash> gdb bt
#0  0xc0000000000533f0 in crash_fadump (regs=0x0, str=0xc000000002bfc510 <buf> "sysrq triggered crash") at arch/powerpc/kernel/fadump.c:734
#1  0xc00000000727fba0 in ?? ()

This is crash's bt and registers, for the same crash:

crash> bt
PID: 32170    TASK: c00000000f82e500  CPU: 0    COMMAND: "bash"
  R0:  c0000000000532d4    R1:  """c00000000727fa90"""    R2:  c000000002b13000
  ...
  NIP: c0000000000533f0    MSR: 8000000000001033    OR3: 0000000000000000
  CTR: 0000000000000000    LR:  c00000000002d430    XER: 0000000020040004
  [NIP  : crash_fadump+560]
  [LR   : ppc_panic_event+96]
  #0 [c00000000727fa30] crash_fadump at c00000000005333c
  #1 ["""c00000000727fa90"""] ppc_panic_event at c00000000002d430
  #2 [c00000000727fac0] atomic_notifier_call_chain at c000000000186d08
  #3 [c00000000727fb00] panic at c0000000001492dc
  #4 [c00000000727fba0] sysrq_handle_crash at c00000000094ad98
  #5 [c00000000727fc00] __handle_sysrq at c00000000094b8bc
  #6 [c00000000727fca0] write_sysrq_trigger at c00000000094c148
  #7 [c00000000727fce0] proc_reg_write at c00000000063c78c
  #8 [c00000000727fd10] vfs_write at c00000000056a0e4
  #9 [c00000000727fd60] ksys_write at c00000000056a694
crash> info reg r1 pc
r1             0xc00000000727fa90  13835058055402224272
pc             0xc0000000000533f0  0xc0000000000533f0 <crash_fadump+560>

Notice that the `pc` register is pointing to `crash_fadump`, but `r1` (ie. the
stack pointer) is in the function activation frame of `ppc_panic_event`.

GDB uses the `pc` register to get the function name, but reads other registers
according to debuginfo of the function it got from `pc`.
Since `pc` is in `crash_fadump`, it takes it's debuginfo, and let's say the
debuginfo says to find register LR at +46 offset from `r1`, thinking `r1` will
be the stack pointer of `crash_fadump`.
But instead since `r1` is inside `ppc_panic_event`, it accesses
`ppc_panic_event`'s frame, and due to this mismatch, it mostly reads some
invalid value for the registers, and the unwinding fails.

This is not due to gdb or the patch series, and instead was an issue with
storing registers in the kernel itself

Crash is not affected by this, since it simply reads the stack as an
array, reading 0th offset from SP to get caller's SP (backchain), and 16th
offset from SP to get LR. While gdb is trying to use some other function's
(`crash_fadump`) debuginfo to some other function's (`ppc_panic_event`)
frame while unwinding
Got it. Thanks a lot.
Sorry if it's confusing, I can explain the case of 'kdump' crash also if
needed.

And now about the invalid kernel virtual addresses, these are approaches
to handle it in my case:

1. Suppress those warnings when called from gdb:
    One way is to suppress those messages while fetching registers, but this
    might be counterproductive if it hides any other invalid accesses
2. Checking if this issue of NIP and SP mismatch is there, and print message in
gdb:
    I don't think we can even do that, since it's the kernel which gave us the
    wrong values, to gdb it's just a memory address, which it
    assumes is pointing to stack of some function, but all other values are
    random values (even considering that the stack has predefined format of
    where registers will be).

Crash tool only reads out the data from a core dump file, won't change(or modify) any data. Let's keep the original data, but we can give some warnings to inform users so that they know what exactly is going on, which could be helpful to debug(or identify issues).

Currently I am inclined to 2 if we have any ideas, or simply leave it as is,
since the registers are invalid so gdb mode will anyways not work.

Sounds reasonable. Let's read the code in detail based on your next post.

Thanks.

Lianbo


Thanks,
Aditya Gupta

Thanks.

Lianbo

# gdb /tmp/core.126506
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-12
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
      <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
"0x7fffe9bd9b38s": not in executable format: file format not recognized
(gdb) file crash
Reading symbols from crash...
(gdb) bt
No stack.

The patches are not intended to apply to gdb as such, but to provide the feature
to have backtrace in gdb mode inside crash-utility.

But the message by gdb seems to say it couldn't read the dump file:

"0x7fffe9bd9b38s": not in executable format: file format not recognized
I will try to cause a crash with upstream kernel and see if anything breaks.
Will let you know.

Thanks,
Aditya Gupta

Thanks.

Lianbo

On 12/4/23 22:59, Aditya Gupta wrote:
The Problem:
============

Currently crash is unable to show function arguments and local variables, as
gdb can do. And functionality for moving between frames ('up'/'down') is not
working in crash.

Crash has 'gdb passthroughs' for things gdb can do, but the gdb passthroughs
'bt', 'frame', 'info locals', 'up', 'down' are not working either, due to
gdb not getting the register values from `crash_target::fetch_registers`,
which then uses `machdep->get_cpu_reg`, which is not implemented for PPC64

Proposed Solution:
==================

Fix the gdb passthroughs by implementing "machdep->get_cpu_reg" for PPC64.
This way, "gdb mode in crash" will support this feature for both ELF and
kdump-compressed vmcore formats, while "gdb" would only have supported ELF
format

This way other features of 'gdb', such as seeing
backtraces/registers/variables/arguments/local variables, moving up and
down stack frames, can be used with any ppc64 vmcore, irrespective of
being ELF format or kdump-compressed format.

Implications on Architectures:
====================================

No architecture other than PPC64 has been affected, other than in case of
'frame' command

As mentioned in patch #2, since frame will not be prohibited, so it will print:

	crash> frame
	#0  <unavailable> in ?? ()

Instead of before prohibited message:

	crash> frame
	crash: prohibited gdb command: frame

Major change will be in 'gdb mode' on PPC64, that it will print the frames, and
local variables, instead of failing with errors showing no frame, or showing
that couldn't get PC, it will be able to give all this information.

Testing:
========

Git tree with this patch series applied:
https://github.com/adi-g15-ibm/crash/tree/stack-unwind-3

To test various gdb passthroughs:

	gdb> set
	gdb> set gdb on
	gdb> thread
	gdb> bt
	gdb> info threads
	gdb> info threads
	gdb> info locals
	gdb> info variables irq_rover_lock
	gdb> info args
	gdb> thread 2
	gdb> set gdb off
	gdb> set
	gdb> set -c 6
	gdb> gdb thread
	gdb> bt
	gdb> gdb bt
	gdb> frame
	gdb> up
	gdb> down
	gdb> info locals

Known Issues:
=============

1. In gdb mode, 'bt' might fail to show backtrace in few vmcores collected
      from older kernels. This is a known issue due to register mismatch, and
      its fix has been merged upstream:

Commit: https://github.com/torvalds/linux/commit/b684c09f09e7a6af3794d4233ef785819e72db79

Fixing GDB passthroughs on other architectures
==============================================

Much of the work for making gdb passthroughs like 'gdb bt', 'gdb
thread', 'gdb info locals' etc. has been done by the patches introducing
'machdep->get_cpu_reg' and this series fixing some issues in that.

Other architectures should be able to fix these gdb functionalities by
simply implementing 'machdep->get_cpu_reg (cpu, regno, ...)'.

The reasoning behind that has been explained with a diagram in commit
description of patch #1

I will assist with my findings/observations fixing it on ppc64 whenever needed.

Additional Notes:
=================

Sorry, it took a long time to send this version. Tried fixing 'info
threads' but wasn't able to. Gave it time again, and was able to fix it
this time after multiple days of debugging.

Some other things from last version review:

* 'info rv' not working:
     It's not supported in gdb, instead we need to use 'info locals rv' or
     'info variables rv'

* 'info variables' command hangs... and prints nothing after hanging for long
     It likely hangs due to a lot of symbols being there, and it's trying to
     get all gdb's output and page it, so Control+C messes it up, but if we pass
     a regex filter to limit the output, eg. info variables rq, then it doesn't
     hang, and prints the variables/symbols.
     Even with gdb, ie. simply running 'gdb vmlinux vmcore' also hangs due
     to the lot of symbols

* making crashing thread as default in gdb:
     This is implemented now, along with synchronising crash & gdb contexts, in
     patch #3

* 'info threads' not working:
     This turned to be due to a bug in gdb_interface. I fixed 'info
     threads' in 2 patches, to simplify it, first for the gdb_interface,
     and another patch for setting the context correctly in crash

* other info commands:
     I tested all the info commands, in crash along with this patch.
     Most of those that fail in crash are due to gdb itself not supporting
     them with vmcores, and other than that is the 'info pretty' command,
     which might not be needed in crash anyways

* live debugging showing only one thread:
     I tried it with crash, crash shows only the current thread, ie.
     itself, so it does not have information of registers for the other
     CPUs. Similarly gdb does not support live kernel debugging (without
     connecting to a gdbstub/QEMU etc.).
     If you need I can make it show the current thread id correctly for
     the one thread, but I don't think it might help much with live
     debugging

Hope, I set the context, thanks for the reviews, I replied and worked
on your suggestions, but got stuck there due to 'info threads'

Changelog:
==========

V3:
+ default gdb thread will be the crashing thread, instead of being
     thread '0'
+ synchronise crash cpu and gdb thread context
+ fix bug in gdb_interface, that replaced gdb's output stream, losing
     output in some cases, such as info threads and extra output in info
     variables
+ fix 'info threads'

RFC V2:
     - removed patch implementing 'frame', 'up', 'down' in crash
     - updated the cover letter by removing the mention of those commands other
	than the respective gdb passthrough

Aditya Gupta (5):
     ppc64: correct gdb passthroughs by implementing machdep->get_cpu_reg
     remove 'frame' from prohibited commands list
     synchronise cpu context changes between crash/gdb
     fix gdb_interface: restore gdb's output streams at end of
       gdb_interface
     fix 'info threads' command

    crash_target.c  |  44 ++++++++++++++++
    defs.h          | 130 +++++++++++++++++++++++++++++++++++++++++++++++-
    gdb-10.2.patch  | 110 +++++++++++++++++++++++++++++++++++++++-
    gdb_interface.c |   2 +-
    kernel.c        |  47 +++++++++++++++--
    ppc64.c         |  95 +++++++++++++++++++++++++++++++++--
    task.c          |  14 ++++++
    tools.c         |   2 +-
    8 files changed, 434 insertions(+), 10 deletions(-)

--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki




[Index of Archives]     [Fedora Development]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]     [Fedora Tools]

 

Powered by Linux