[Crash-utility] Re: [PATCH v3 0/5] Improve stack unwind on ppc64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tao,

On Wed, Dec 13, 2023 at 09:03:37PM +0800, Tao Liu wrote:
> Hi Aditya,
> 
> I encountered a problem for analyze the ppc64 vmcore after applied all
> patches in the patchset:
> 
> crash> gdb bt
> #0  0xc000000000279d98 in crash_setup_regs (gdb: invalid kernel
> virtual address: fffffffffffffffb  type: "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff7  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff3  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffffb  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff7  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff3  type:
> "gdb_readmem callback"
> oldregs=<optimized out>, newregs=0xc000000012e87968) at
> ./arch/powerpc/include/asm/kexec.h:69
> #1  __crash_kexec (regs=<optimized out>) at kernel/kexec_core.c:975
> #2  0xfffffffffffffffb in ?? ()
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> crash> gdb info threads
>   Id   Target Id         Frame
>   1    CPU 0             plpar_hcall_norets_notrace () at
> arch/powerpc/platforms/pseries/hvCall.S:112
> * 2    CPU 1             0xc000000000279d98 in crash_setup_regs (gdb:
> invalid kernel virtual address: fffffffffffffffb  type: "gdb_readmem
> callback"
> gdb: invalid kernel virtual address: fffffffffffffff7  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff3  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffffb  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff7  type:
> "gdb_readmem callback"
> gdb: invalid kernel virtual address: fffffffffffffff3  type:
> "gdb_readmem callback"
> oldregs=<optimized out>, newregs=0xc000000012e87968) at
> ./arch/powerpc/include/asm/kexec.h:69
> 
> Seems the crash stack unwinding gave a wrong value to gdb. I tried for
> some time to find out the root cause but got unlucky. Hope you can
> help me out. I can give you the vmcore to analyze this issue in
> another mail. Thanks in advance!

This is due to a kernel bug actually. It was backported in stable kernel
versions, will try to ask RHEL to include those fixes if possible.

Rest of the mail is exactly same as in reply to Lianbo, please skip if already read.

Thanks,
Aditya Gupta

---

Okay, I just recalled this is a known issue with the kernel code to save
registers, and I fixed it in upstream linux, with the commit:

commit b684c09f09e7a6af3794d4233ef785819e72db79
Author: Aditya Gupta <adityag@xxxxxxxxxxxxx>
Date:   Thu Jun 15 14:40:47 2023 +0530

    powerpc: update ppc_save_regs to save current r1 in pt_regs

Please bear with me for the long explaination.

Basically before this commit, the coredump used to have `nip` (program counter)
pointing to the latest function call, BUT `r1` (stack pointer) pointing at
the 2nd latest function call.

And due to this, actually crash's backtrace itself is slightly wrong. crash
actually skips printing one topmost frame, and shows wrong instruction pointer
for top frame

In case of kdump, it skips 'crash_setup_regs''s activation frame, but that is
not very noticeable since it's inlined in '__crash_kexec', I will use 'fadump's'
example since it's easier to notice the issue there:

Coming back to gdb mode.
This is gdb's backtrace, for a fadump crash caused by 'echo c >
/proc/sysrq-trigger':

crash> gdb bt
#0  0xc0000000000533f0 in crash_fadump (regs=0x0, str=0xc000000002bfc510 <buf> "sysrq triggered crash") at arch/powerpc/kernel/fadump.c:734
#1  0xc00000000727fba0 in ?? ()

This is crash's bt and registers, for the same crash:

crash> bt
PID: 32170    TASK: c00000000f82e500  CPU: 0    COMMAND: "bash"
 R0:  c0000000000532d4    R1:  """c00000000727fa90"""    R2:  c000000002b13000
 ...
 NIP: c0000000000533f0    MSR: 8000000000001033    OR3: 0000000000000000
 CTR: 0000000000000000    LR:  c00000000002d430    XER: 0000000020040004
 [NIP  : crash_fadump+560]
 [LR   : ppc_panic_event+96]
 #0 [c00000000727fa30] crash_fadump at c00000000005333c
 #1 ["""c00000000727fa90"""] ppc_panic_event at c00000000002d430
 #2 [c00000000727fac0] atomic_notifier_call_chain at c000000000186d08
 #3 [c00000000727fb00] panic at c0000000001492dc
 #4 [c00000000727fba0] sysrq_handle_crash at c00000000094ad98
 #5 [c00000000727fc00] __handle_sysrq at c00000000094b8bc
 #6 [c00000000727fca0] write_sysrq_trigger at c00000000094c148
 #7 [c00000000727fce0] proc_reg_write at c00000000063c78c
 #8 [c00000000727fd10] vfs_write at c00000000056a0e4
 #9 [c00000000727fd60] ksys_write at c00000000056a694
crash> info reg r1 pc
r1             0xc00000000727fa90  13835058055402224272
pc             0xc0000000000533f0  0xc0000000000533f0 <crash_fadump+560>

Notice that the `pc` register is pointing to `crash_fadump`, but `r1` (ie. the
stack pointer) is in the function activation frame of `ppc_panic_event`.

GDB uses the `pc` register to get the function name, but reads other registers
according to debuginfo of the function it got from `pc`.
Since `pc` is in `crash_fadump`, it takes it's debuginfo, and let's say the
debuginfo says to find register LR at +46 offset from `r1`, thinking `r1` will
be the stack pointer of `crash_fadump`.
But instead since `r1` is inside `ppc_panic_event`, it accesses
`ppc_panic_event`'s frame, and due to this mismatch, it mostly reads some
invalid value for the registers, and the unwinding fails.

This is not due to gdb or the patch series, and instead was an issue with
storing registers in the kernel itself

Crash is not affected by this, since it simply reads the stack as an
array, reading 0th offset from SP to get caller's SP (backchain), and 16th
offset from SP to get LR. While gdb is trying to use some other function's
(`crash_fadump`) debuginfo to some other function's (`ppc_panic_event`)
frame while unwinding

Sorry if it's confusing, I can explain the case of 'kdump' crash also if
needed.

And now about the invalid kernel virtual addresses, these are approaches
to handle it in my case:

1. Suppress those warnings when called from gdb:
   One way is to suppress those messages while fetching registers, but this
   might be counterproductive if it hides any other invalid accesses
2. Checking if this issue of NIP and SP mismatch is there, and print message in
gdb:
   I don't think we can even do that, since it's the kernel which gave us the
   wrong values, to gdb it's just a memory address, which it
   assumes is pointing to stack of some function, but all other values are
   random values (even considering that the stack has predefined format of
   where registers will be).

Currently I am inclined to 2 if we have any ideas, or simply leave it as is,
since the registers are invalid so gdb mode will anyways not work.

Thanks,
Aditya Gupta

> 
> Currently I have made the x86_64 stack unwinding work based on your
> patchset. And I plan to post it upstream once your patchsets get
> merged. In addition, is there a plan to support the stack unwinding
> for live debugging in ppc64 arch? I think it is a useful feature
> too...
> 
> Thanks,
> Tao Liu
> 
> 
> 
> 
> 
> On Tue, Dec 12, 2023 at 12:51 PM Aditya Gupta <adityag@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Dec 11, 2023 at 08:04:50PM +0800, Lianbo Jiang wrote:
> > > On 12/9/23 20:45, Aditya Gupta wrote:
> > >
> > > > Hi, just a ping. Any comments on the series ?
> > >
> > > Hi, Aditya
> > >
> > >
> > > Thank you for the update. I will have a look and do the tests this week. And
> > > give some feedback.
> >
> > Sure. Thanks Lianbo.
> >
> > - Aditya Gupta
> >
> > >
> > > Thanks.
> > >
> > > Lianbo
> > >
> > > >
> > > > On Mon, Dec 04, 2023 at 08:29:36PM +0530, Aditya Gupta wrote:
> > > > > The Problem:
> > > > > ============
> > > > >
> > > > > Currently crash is unable to show function arguments and local variables, as
> > > > > gdb can do. And functionality for moving between frames ('up'/'down') is not
> > > > > working in crash.
> > > > >
> > > > > Crash has 'gdb passthroughs' for things gdb can do, but the gdb passthroughs
> > > > > 'bt', 'frame', 'info locals', 'up', 'down' are not working either, due to
> > > > > gdb not getting the register values from `crash_target::fetch_registers`,
> > > > > which then uses `machdep->get_cpu_reg`, which is not implemented for PPC64
> > > > >
> > > > > Proposed Solution:
> > > > > ==================
> > > > >
> > > > > Fix the gdb passthroughs by implementing "machdep->get_cpu_reg" for PPC64.
> > > > > This way, "gdb mode in crash" will support this feature for both ELF and
> > > > > kdump-compressed vmcore formats, while "gdb" would only have supported ELF
> > > > > format
> > > > >
> > > > > This way other features of 'gdb', such as seeing
> > > > > backtraces/registers/variables/arguments/local variables, moving up and
> > > > > down stack frames, can be used with any ppc64 vmcore, irrespective of
> > > > > being ELF format or kdump-compressed format.
> > > > >
> > > > > Implications on Architectures:
> > > > > ====================================
> > > > >
> > > > > No architecture other than PPC64 has been affected, other than in case of
> > > > > 'frame' command
> > > > >
> > > > > As mentioned in patch #2, since frame will not be prohibited, so it will print:
> > > > >
> > > > >   crash> frame
> > > > >   #0  <unavailable> in ?? ()
> > > > >
> > > > > Instead of before prohibited message:
> > > > >
> > > > >   crash> frame
> > > > >   crash: prohibited gdb command: frame
> > > > >
> > > > > Major change will be in 'gdb mode' on PPC64, that it will print the frames, and
> > > > > local variables, instead of failing with errors showing no frame, or showing
> > > > > that couldn't get PC, it will be able to give all this information.
> > > > >
> > > > > Testing:
> > > > > ========
> > > > >
> > > > > Git tree with this patch series applied:
> > > > > https://github.com/adi-g15-ibm/crash/tree/stack-unwind-3
> > > > >
> > > > > To test various gdb passthroughs:
> > > > >
> > > > >   gdb> set
> > > > >   gdb> set gdb on
> > > > >   gdb> thread
> > > > >   gdb> bt
> > > > >   gdb> info threads
> > > > >   gdb> info threads
> > > > >   gdb> info locals
> > > > >   gdb> info variables irq_rover_lock
> > > > >   gdb> info args
> > > > >   gdb> thread 2
> > > > >   gdb> set gdb off
> > > > >   gdb> set
> > > > >   gdb> set -c 6
> > > > >   gdb> gdb thread
> > > > >   gdb> bt
> > > > >   gdb> gdb bt
> > > > >   gdb> frame
> > > > >   gdb> up
> > > > >   gdb> down
> > > > >   gdb> info locals
> > > > >
> > > > > Known Issues:
> > > > > =============
> > > > >
> > > > > 1. In gdb mode, 'bt' might fail to show backtrace in few vmcores collected
> > > > >     from older kernels. This is a known issue due to register mismatch, and
> > > > >     its fix has been merged upstream:
> > > > >
> > > > > Commit: https://github.com/torvalds/linux/commit/b684c09f09e7a6af3794d4233ef785819e72db79
> > > > >
> > > > > Fixing GDB passthroughs on other architectures
> > > > > ==============================================
> > > > >
> > > > > Much of the work for making gdb passthroughs like 'gdb bt', 'gdb
> > > > > thread', 'gdb info locals' etc. has been done by the patches introducing
> > > > > 'machdep->get_cpu_reg' and this series fixing some issues in that.
> > > > >
> > > > > Other architectures should be able to fix these gdb functionalities by
> > > > > simply implementing 'machdep->get_cpu_reg (cpu, regno, ...)'.
> > > > >
> > > > > The reasoning behind that has been explained with a diagram in commit
> > > > > description of patch #1
> > > > >
> > > > > I will assist with my findings/observations fixing it on ppc64 whenever needed.
> > > > >
> > > > > Additional Notes:
> > > > > =================
> > > > >
> > > > > Sorry, it took a long time to send this version. Tried fixing 'info
> > > > > threads' but wasn't able to. Gave it time again, and was able to fix it
> > > > > this time after multiple days of debugging.
> > > > >
> > > > > Some other things from last version review:
> > > > >
> > > > > * 'info rv' not working:
> > > > >    It's not supported in gdb, instead we need to use 'info locals rv' or
> > > > >    'info variables rv'
> > > > >
> > > > > * 'info variables' command hangs... and prints nothing after hanging for long
> > > > >    It likely hangs due to a lot of symbols being there, and it's trying to
> > > > >    get all gdb's output and page it, so Control+C messes it up, but if we pass
> > > > >    a regex filter to limit the output, eg. info variables rq, then it doesn't
> > > > >    hang, and prints the variables/symbols.
> > > > >    Even with gdb, ie. simply running 'gdb vmlinux vmcore' also hangs due
> > > > >    to the lot of symbols
> > > > >
> > > > > * making crashing thread as default in gdb:
> > > > >    This is implemented now, along with synchronising crash & gdb contexts, in
> > > > >    patch #3
> > > > >
> > > > > * 'info threads' not working:
> > > > >    This turned to be due to a bug in gdb_interface. I fixed 'info
> > > > >    threads' in 2 patches, to simplify it, first for the gdb_interface,
> > > > >    and another patch for setting the context correctly in crash
> > > > >
> > > > > * other info commands:
> > > > >    I tested all the info commands, in crash along with this patch.
> > > > >    Most of those that fail in crash are due to gdb itself not supporting
> > > > >    them with vmcores, and other than that is the 'info pretty' command,
> > > > >    which might not be needed in crash anyways
> > > > >
> > > > > * live debugging showing only one thread:
> > > > >    I tried it with crash, crash shows only the current thread, ie.
> > > > >    itself, so it does not have information of registers for the other
> > > > >    CPUs. Similarly gdb does not support live kernel debugging (without
> > > > >    connecting to a gdbstub/QEMU etc.).
> > > > >    If you need I can make it show the current thread id correctly for
> > > > >    the one thread, but I don't think it might help much with live
> > > > >    debugging
> > > > >
> > > > > Hope, I set the context, thanks for the reviews, I replied and worked
> > > > > on your suggestions, but got stuck there due to 'info threads'
> > > > >
> > > > > Changelog:
> > > > > ==========
> > > > >
> > > > > V3:
> > > > > + default gdb thread will be the crashing thread, instead of being
> > > > >    thread '0'
> > > > > + synchronise crash cpu and gdb thread context
> > > > > + fix bug in gdb_interface, that replaced gdb's output stream, losing
> > > > >    output in some cases, such as info threads and extra output in info
> > > > >    variables
> > > > > + fix 'info threads'
> > > > >
> > > > > RFC V2:
> > > > >    - removed patch implementing 'frame', 'up', 'down' in crash
> > > > >    - updated the cover letter by removing the mention of those commands other
> > > > >   than the respective gdb passthrough
> > > > >
> > > > > Aditya Gupta (5):
> > > > >    ppc64: correct gdb passthroughs by implementing machdep->get_cpu_reg
> > > > >    remove 'frame' from prohibited commands list
> > > > >    synchronise cpu context changes between crash/gdb
> > > > >    fix gdb_interface: restore gdb's output streams at end of
> > > > >      gdb_interface
> > > > >    fix 'info threads' command
> > > > >
> > > > >   crash_target.c  |  44 ++++++++++++++++
> > > > >   defs.h          | 130 +++++++++++++++++++++++++++++++++++++++++++++++-
> > > > >   gdb-10.2.patch  | 110 +++++++++++++++++++++++++++++++++++++++-
> > > > >   gdb_interface.c |   2 +-
> > > > >   kernel.c        |  47 +++++++++++++++--
> > > > >   ppc64.c         |  95 +++++++++++++++++++++++++++++++++--
> > > > >   task.c          |  14 ++++++
> > > > >   tools.c         |   2 +-
> > > > >   8 files changed, 434 insertions(+), 10 deletions(-)
> > > > >
> > > > > --
> > > > > 2.41.0
> > > > >
> > >
> > --
> > Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> > To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
> > Contribution Guidelines: https://github.com/crash-utility/crash/wiki
> 
--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki




[Index of Archives]     [Fedora Development]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]     [Fedora Tools]

 

Powered by Linux