[Crash-utility] Re: [PATCH v7 00/15] gdb stack unwinding support for crash utility

lijiang <lijiang@xxxxxxxxxx> · Mon, 4 Nov 2024 16:09:40 +0800

Thank you for working on this feature, Aditya, Tao and Alex. Great job!

For the [PATCH v7 02/15 -15/15], rearranged them with minor changes:

[1] https://github.com/crash-utility/crash/commit/21e0a345f97324b3472d573ed20ef098f0300fac
[2] https://github.com/crash-utility/crash/commit/c4db469af091edd1ea0897fbce41bc175375314b
[3] https://github.com/crash-utility/crash/commit/7c8a7dddda66b3d1043ba99516de57691033154a
[4] https://github.com/crash-utility/crash/commit/1fd80c623c205443fdd2a29b14c5230a09984147
[5] https://github.com/crash-utility/crash/commit/6dfda0d2235574cf80530ea92e0ddff270f9c039
[6] https://github.com/crash-utility/crash/commit/89ff1e45734457eb66905ef656775fcfd1b46aec
[7] https://github.com/crash-utility/crash/commit/968debd0d5979dd9ddca3af0766bad714dbd51e3

BTW: there are still some known issues about this one, but not critical issues, so which can be fixed later.

Reminder: the current patchset has changed some function interfaces, which may affect crash extensions.

Thanks
Lianbo

On Wed, Sep 11, 2024 at 10:25 AM lijiang <lijiang@xxxxxxxxxx> wrote:
Hi, Tao

Thank you for the update.

The following patch is a regression issue, so I tend to discuss it as a separate patch.
[PATCH v7 01/15] Fix the regression of cpumask_t for xen hyper

In addition, I found another issue in my tests(on ppc64le), the gdb bt can display the back trace for the panic task, but when I switch to another task, the gdb bt can not display the back trace:

crash> gdb bt
#0  0xc0000000002bde04 in crash_setup_regs (newregs=0xc00000003264b858, oldregs=0x0) at ./arch/powerpc/include/asm/kexec.h:133
#1  0xc0000000002be4f8 in __crash_kexec (regs=0x0) at kernel/crash_core.c:122
#2  0xc00000000016c254 in panic (fmt=0xc0000000015eef20 "sysrq triggered crash\n") at kernel/panic.c:373
#3  0xc000000000a708b8 in sysrq_handle_crash (key=<optimized out>) at drivers/tty/sysrq.c:154
#4  0xc000000000a713d4 in __handle_sysrq (key=key@entry=99 'c', check_mask=check_mask@entry=false) at drivers/tty/sysrq.c:612
#5  0xc000000000a71e94 in write_sysrq_trigger (file=<optimized out>, buf=<optimized out>, count=2, ppos=<optimized out>) at drivers/tty/sysrq.c:1181
#6  0xc00000000073260c in pde_write (pde=0xc00000000af9cc00, file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:334
#7  proc_reg_write (file=<optimized out>, buf=<optimized out>, count=<optimized out>, ppos=<optimized out>) at fs/proc/inode.c:346
#8  0xc00000000063c0e0 in vfs_write (file=0xc0000000092d2900, buf=0x10012536f60 <error: Cannot access memory at address 0x10012536f60>, count=2, pos=0xc00000003264bd30) at fs/read_write.c:588
#9  vfs_write (file=0xc0000000092d2900, buf=0x10012536f60 <error: Cannot access memory at address 0x10012536f60>, count=<optimized out>, pos=0xc00000003264bd30) at fs/read_write.c:570
#10 0xc00000000063c690 in ksys_write (fd=<optimized out>, buf=0x10012536f60 <error: Cannot access memory at address 0x10012536f60>, count=2) at fs/read_write.c:643
#11 0xc000000000031a28 in system_call_exception (regs=0xc00000003264be80, r0=<optimized out>) at arch/powerpc/kernel/syscall.c:153
#12 0xc00000000000d05c in system_call_vectored_common () at arch/powerpc/kernel/interrupt_64.S:198

crash> ps
      PID    PPID  CPU       TASK        ST  %MEM      VSZ      RSS  COMM
        0       0   0  c000000002bda980  RU   0.0        0        0  [swapper/0]
>       0       0   1  c000000003864c80  RU   0.0        0        0  [swapper/1]
...
     8017     923   0  c000000043a20000  IN   0.2    22528    16256  sshd-session
     8025    8017   6  c000000032271880  IN   0.1    22784    11840  sshd-session
>    8026    8025   0  c000000043a26600  RU   0.1     9664     6208  bash
...
    11645       2   3  c000000032264c80  ID   0.0        0        0  [kworker/u32:2]
    11738    6188   2  c00000003811b180  IN   0.1    43520     9408  pickup
    12326       2   0  c00000003226b280  ID   0.0        0        0  [kworker/0:1]
    13112    6089   2  c00000000c809900  IN   0.0     7232     3456  sleep

Let's take the "pickup" task as an example:

crash> set 11738
    PID: 11738
COMMAND: "pickup"
   TASK: c00000003811b180  [THREAD_INFO: c00000003811b180]
    CPU: 2
  STATE: TASK_INTERRUPTIBLE

crash> gdb bt
#0  0xc0000000a7f876a0 in ?? ()
gdb: gdb request failed: bt
crash> set gdb on
gdb: on
gdb> bt
#0  0xc0000000a7f876a0 in ?? ()
gdb> 

Anyway, I did the same test on x86 64 and aarch64, it can work well as expected. Can you help to double check on ppc64 architecture?

X86 64:
crash> set 14599
    PID: 14599
COMMAND: "pickup"
   TASK: ffff8f57a0d7c180  [THREAD_INFO: ffff8f57a0d7c180]
    CPU: 41
  STATE: TASK_INTERRUPTIBLE 
crash> gdb bt
#0  0xffffffff8b3efe29 in context_switch (rq=0xffff8f6f1f835900, prev=0xffff8f57a0d7c180, next=0xffff8f5786720000, rf=0xffff9df22fea7b80) at kernel/sched/core.c:5208
#1  __schedule (sched_mode=sched_mode@entry=0) at kernel/sched/core.c:6549
#2  0xffffffff8b3f0217 in __schedule_loop (sched_mode=<optimized out>) at kernel/sched/core.c:6626
#3  schedule () at kernel/sched/core.c:6641
#4  0xffffffff8b3f6eef in schedule_hrtimeout_range_clock (expires=expires@entry=0xffff9df22fea7cb0, delta=<optimized out>, delta@entry=99999999, mode=mode@entry=HRTIMER_MODE_ABS, clock_id=clock_id@entry=1) at kernel/time/hrtimer.c:2293
#5  0xffffffff8b3f7003 in schedule_hrtimeout_range (expires=expires@entry=0xffff9df22fea7cb0, delta=delta@entry=99999999, mode=mode@entry=HRTIMER_MODE_ABS) at kernel/time/hrtimer.c:2340
#6  0xffffffff8aae301c in ep_poll (ep=0xffff8f5790d15d40, events=events@entry=0x7ffea91b6b90, maxevents=maxevents@entry=100, timeout=timeout@entry=0xffff9df22fea7d58) at fs/eventpoll.c:2062
#7  0xffffffff8aae3138 in do_epoll_wait (epfd=epfd@entry=8, events=events@entry=0x7ffea91b6b90, maxevents=maxevents@entry=100, to=0xffff9df22fea7d58) at fs/eventpoll.c:2464
#8  0xffffffff8aae44a1 in __do_sys_epoll_wait (epfd=<optimized out>, events=0x7ffea91b6b90, maxevents=<optimized out>, timeout=<optimized out>) at fs/eventpoll.c:2476
#9  __se_sys_epoll_wait (epfd=<optimized out>, events=<optimized out>, maxevents=<optimized out>, timeout=<optimized out>) at fs/eventpoll.c:2471
#10 __x64_sys_epoll_wait (regs=<optimized out>) at fs/eventpoll.c:2471
#11 0xffffffff8b3e293d in do_syscall_x64 (regs=0xffff9df22fea7f48, nr=232) at arch/x86/entry/common.c:52
#12 do_syscall_64 (regs=0xffff9df22fea7f48, nr=232) at arch/x86/entry/common.c:83
#13 0xffffffff8b40012f in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121
crash> 

aarch64:
crash> set 9338
    PID: 9338
COMMAND: "pickup"
   TASK: ffff0000c7b05400  [THREAD_INFO: ffff0000c7b05400]
    CPU: 3
  STATE: TASK_INTERRUPTIBLE 
crash> gdb bt
#0  __switch_to (prev=<unavailable>, prev@entry=0xffff0000c7b05400, next=next@entry=<unavailable>) at arch/arm64/kernel/process.c:555
#1  0xffffafc5b5ebd744 in context_switch (rq=0xffff00077bbd0ec0, prev=0xffff0000c7b05400, next=<unavailable>, rf=0xffff80008ac63a60) at kernel/sched/core.c:5208
#2  __schedule (sched_mode=sched_mode@entry=0) at kernel/sched/core.c:6549
#3  0xffffafc5b5ebdc2c in __schedule_loop (sched_mode=<optimized out>) at kernel/sched/core.c:6626
#4  schedule () at kernel/sched/core.c:6641
#5  0xffffafc5b5ec6030 in schedule_hrtimeout_range_clock (expires=expires@entry=0xffff80008ac63be8, delta=delta@entry=99999999, mode=mode@entry=HRTIMER_MODE_ABS, clock_id=clock_id@entry=1) at kernel/time/hrtimer.c:2293
#6  0xffffafc5b5ec618c in schedule_hrtimeout_range (expires=expires@entry=0xffff80008ac63be8, delta=delta@entry=99999999, mode=mode@entry=HRTIMER_MODE_ABS) at kernel/time/hrtimer.c:2340
#7  0xffffafc5b545d33c in ep_poll (ep=<unavailable>, events=events@entry=0xffffde5c3f68, maxevents=maxevents@entry=100, timeout=timeout@entry=0xffff80008ac63ce0) at fs/eventpoll.c:2062
#8  0xffffafc5b545d4e4 in do_epoll_wait (epfd=epfd@entry=8, events=events@entry=0xffffde5c3f68, maxevents=maxevents@entry=100, to=to@entry=0xffff80008ac63ce0) at fs/eventpoll.c:2464
#9  0xffffafc5b545d534 in do_epoll_pwait (epfd=epfd@entry=8, events=events@entry=0xffffde5c3f68, maxevents=maxevents@entry=100, to=to@entry=0xffff80008ac63ce0, sigsetsize=<optimized out>, sigmask=<optimized out>) at fs/eventpoll.c:2498
#10 0xffffafc5b545e7c8 in do_epoll_pwait (epfd=8, events=0xffffde5c3f68, maxevents=100, to=0xffff80008ac63ce0, sigmask=<optimized out>, sigsetsize=<optimized out>) at fs/eventpoll.c:2495
#11 __do_sys_epoll_pwait (epfd=8, events=0xffffde5c3f68, maxevents=100, timeout=<optimized out>, sigmask=<optimized out>, sigsetsize=<optimized out>) at fs/eventpoll.c:2511
#12 __se_sys_epoll_pwait (epfd=8, events=281474412330856, maxevents=100, timeout=<optimized out>, sigmask=<optimized out>, sigsetsize=<optimized out>) at fs/eventpoll.c:2505
#13 __arm64_sys_epoll_pwait (regs=<optimized out>) at fs/eventpoll.c:2505
#14 0xffffafc5b4fa99bc in __invoke_syscall (regs=0xffff80008ac63eb0, syscall_fn=<optimized out>) at arch/arm64/kernel/syscall.c:35
#15 invoke_syscall (regs=regs@entry=0xffff80008ac63eb0, scno=<optimized out>, sc_nr=sc_nr@entry=463, syscall_table=<optimized out>) at arch/arm64/kernel/syscall.c:49
#16 0xffffafc5b4fa9ac8 in el0_svc_common (sc_nr=463, syscall_table=<optimized out>, regs=0xffff80008ac63eb0, scno=<optimized out>) at arch/arm64/kernel/syscall.c:132
#17 do_el0_svc (regs=regs@entry=0xffff80008ac63eb0) at arch/arm64/kernel/syscall.c:151
#18 0xffffafc5b5eb6fa4 in el0_svc (regs=0xffff80008ac63eb0) at arch/arm64/kernel/entry-common.c:712
#19 0xffffafc5b5eb74c0 in el0t_64_sync_handler (regs=<optimized out>) at arch/arm64/kernel/entry-common.c:730
#20 0xffffafc5b4f91634 in el0t_64_sync () at arch/arm64/kernel/entry.S:598
crash>

BTW:  other changes are fine to me.

Thanks
Lianbo

On Wed, Sep 4, 2024 at 3:54 PM <devel-request@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

Date: Wed,  4 Sep 2024 19:49:25 +1200

From: Tao Liu <ltao@xxxxxxxxxx>

Subject:  [PATCH v7 00/15] gdb stack unwinding support

        for crash utility

To: devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx

Cc: Tao Liu <ltao@xxxxxxxxxx>

Message-ID: <20240904074940.21331-1-ltao@xxxxxxxxxx>

Content-Type: text/plain; charset=UTF-8

This patchset is a rebase/merged version of the following 3 patchsets:

1): [PATCH v10 0/5] Improve stack unwind on ppc64 [1]

2): [PATCH 0/5] x86_64 gdb stack unwinding support [2]

3): Clean up on top of one-thread-v2 [3]

A complete description of gdb stack unwinding support for crash can be

found in [1].

This patchset can be divided into the following 3 parts:

1) part1: preparations before stack unwinding support, some

          bugs/regressions found when drafting this patchset.

2) part2: common part for all CPU archs, mainly dealing with

          crash_target.c/gdb_interface.c files, in order to

          support different archs.

3) part3: arch specific, for each ppc64/x86_64/arm64/vmware

          stack unwinding support.

=== part 3

arm64: Add gdb stack unwinding support

vmware_guestdump: Various format versions support

x86_64: Add gdb stack unwinding support

ppc64: correct gdb passthroughs by implementing machdep->get_current_task_reg

=== part 2

Conditionally output gdb stack unwinding stop reasons

Stop stack unwinding at non-kernel address

Print task pid/command instead of CPU index

Rename get_cpu_reg to get_current_task_reg

Let crash change gdb context

Leave only one gdb thread for crash

Remove 'frame' from prohibited commands list

=== part 1

Fix gdb_interface: restore gdb's output streams at end of gdb_interface

x86_64: Fix invalid input "=>" for bt command

Fix cpumask_t recursive dependence issue

Fix the regression of cpumask_t for xen hyper

===

v7 -> v6:

1) Reorganise the patchset, re-divided them into 3 part against the

   previous 2 parts.

2) Re-dealed with the cpumask_t part, which solved the comment No.4

   pointed out by lianbo in [4].

3) Add conditional output for the failing message of gdb stack unwinding.

   see [PATCH 11/15] Conditionally output gdb stack unwinding stop reasons

4) Redraft the commit messages, updated some outdated info.

5) Merged "Let crash change gdb context" and "set_context(): check if

   context is already current" into one.

[4]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg01067.html

v6 -> v5:

1) Refactor patch 4 & 9, which changed the function signature of struct

   get_cpu_reg/get_current_task_reg, and let each patch compile with no

   error when added on.

2) Rebased the patchset on top of latest upstream:

   ("79b93ecb2e72ec Fix a "Bus error" issue caused by 'crash --osrelease' or

   crash loading")

v5 -> v4:

1) Plenty of code refactoring based on Lianbo's comments on v4.

2) Removed the magic number when dealing with regs bitmap, see [6].

3) Rebased the patchset on top of latest upstream:

   ("1c6da3eaff8207 arm64: Fix bt command show wrong stacktrace on ramdump source")

v4 -> v3:

Fixed the author issue in [PATCH v3 06/16] Fix gdb_interface: restore gdb's

output streams at end of gdb_interface.

v3 -> v2:

1) Updated CC list as pointed out in [4]

2) Compiling issues as in [5] 

v2 -> v1:

1) Added the patch: x86_64: Fix invalid input "=>" for bt command,

   thanks for Kazu's testing.

2) Modify the patch: x86_64: Add gdb stack unwinding support, added the 

   pcp_save, spp_save and sp, for restoring the value in match of the original

   code logic.

[1]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00469.html

[2]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00488.html

[3]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00554.html

[4]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00681.html

[5]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00715.html

[6]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00819.html

Aditya Gupta (3):

  Fix gdb_interface: restore gdb's output streams at end of

    gdb_interface

  Remove 'frame' from prohibited commands list

  ppc64: correct gdb passthroughs by implementing

    machdep->get_current_task_reg

Alexey Makhalov (1):

  vmware_guestdump: Various format versions support

Tao Liu (11):

  Fix the regression of cpumask_t for xen hyper

  Fix cpumask_t recursive dependence issue

  x86_64: Fix invalid input "=>" for bt command

  Leave only one gdb thread for crash

  Let crash change gdb context

  Rename get_cpu_reg to get_current_task_reg

  Print task pid/command instead of CPU index

  Stop stack unwinding at non-kernel address

  Conditionally output gdb stack unwinding stop reasons

  x86_64: Add gdb stack unwinding support

  arm64: Add gdb stack unwinding support

 arm64.c            | 120 +++++++++++++++--

 crash_target.c     |  71 ++++++----

 defs.h             | 194 ++++++++++++++++++++++++++-

 gdb-10.2.patch     |  96 ++++++++++++++

 gdb_interface.c    |  39 ++----

 kernel.c           |  63 +++++++--

 ppc64.c            | 174 +++++++++++++++++++++++-

 symbols.c          |  15 +++

 task.c             |  34 +++--

 tools.c            |  16 ++-

 unwind_x86_64.h    |   4 -

 vmware_guestdump.c | 321 +++++++++++++++++++++++++++++++-------------

 x86_64.c           | 323 ++++++++++++++++++++++++++++++++++++++++-----

 13 files changed, 1247 insertions(+), 223 deletions(-)

-- 

2.40.1

--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki