[Crash-utility] Re: [PATCH 2/5] Enable crash to change gdb thread context

Tao Liu <ltao@xxxxxxxxxx> · Fri, 15 Mar 2024 10:15:20 +0800

Hi Alexey,

On Fri, Mar 15, 2024 at 10:13 AM Tao Liu <ltao@xxxxxxxxxx> wrote:
>
> Hi Alexey,
>
> On Fri, Mar 15, 2024 at 6:55 AM Alexey Makhalov
> <alexey.makhalov@xxxxxxxxxxxx> wrote:
> >
> > Please find a patch in "crash_target: Support for GDB debugging of all tasks" mail thread. Use the latest version 3 of the patch. Thanks, --Alexey
> >
> > On Wed, Mar 13, 2024 at 10:41 PM Aditya Gupta <adityag@xxxxxxxxxxxxx> wrote:
> >>
> >> Hi Alexey,
> >>
> >> On Wed, Mar 13, 2024 at 12:42:22PM -0700, Alexey Makhalov wrote:
> >> > Hi folks,
> >> >
> >> > I'm working on an improvement in crash_target for gdb to backtrace and
> >> > debug (including local frame variables) all tasks, not just active
> >> > tasks/CPUs
> >> > I will send the patch shortly.
> >>
> >> Thanks for the interest in this. Just in case, with both patch series
> >> combined, it should be possible to backtrace and run info locals, etc,
> >> on arbitrary tasks also. Interested in your patch though :)
> >>
> Yeah, with Aditya's "Improve stack unwind on ppc64" and my "x86_64 gdb
> stack unwinding support" patch series, arbitrary tasks' stack
> unwinding/local variables can be viewed by gdb. So there might be some
> work we have repeated, and I'd like to take reference of your patch
> and work together for this feature. In addition, I didn't find the
> "crash_target: Support for GDB debugging of all tasks" patch, could
> you please just share a link to it?

Found it, sorry for that.

Thanks,
Tao Liu

>
> Thanks,
> Tao Liu
>
> >> Thanks,
> >> Aditya Gupta
> >>
> >> >
> >> > Thanks,
> >> > --Alexey
> >> >
> >> > On Wed, Mar 13, 2024 at 2:34 AM Aditya Gupta <adityag@xxxxxxxxxxxxx> wrote:
> >> >
> >> > > Hi Tao,
> >> > > On Wed, Mar 13, 2024 at 11:23:56AM +0800, Tao Liu wrote:
> >> > > > Hi Aditya,
> >> > > >
> >> > > > On Tue, Mar 12, 2024 at 5:12 PM Aditya Gupta <adityag@xxxxxxxxxxxxx>
> >> > > wrote:
> >> > > > >
> >> > > > > Hi Tao,
> >> > > > >
> >> > > > > > <...>
> >> > > > > >
> >> > > > > > +crash_target *target = NULL;
> >> > > > > > +
> >> > > > > >  void
> >> > > > > >  crash_target_init (void)
> >> > > > > >  {
> >> > > > > >    int nr_cpus = crash_get_nr_cpus();
> >> > > > > > -  crash_target *target = new crash_target ();
> >> > > > > > +  target = new crash_target ();
> >> > > > > >
> >> > > > > >    /* Own the target until it is successfully pushed.  */
> >> > > > > >    target_ops_up target_holder (target);
> >> > > > > > @@ -137,29 +139,35 @@ crash_target_init (void)
> >> > > > > >    reinit_frame_cache ();
> >> > > > > >  }
> >> > > > > >
> >> > > > > > -/*
> >> > > > > > - * Change gdb's thread context to the thread on given CPU
> >> > > > > > - **/
> >> > > > > >  extern "C" int
> >> > > > > > -gdb_change_cpu_context(unsigned int cpu)
> >> > > > > > +gdb_change_thread_context (ulong task)
> >> > > > > >  {
> >> > > > > > -  ptid_t ptid = ptid_t(CRASH_INFERIOR_PID, 0, cpu);
> >> > > > > > -  inferior *inf = current_inferior ();
> >> > > > > > -  thread_info *tp = find_thread_ptid (inf, ptid);
> >> > > > > > -
> >> > > > > > -  if (tp == nullptr)
> >> > > > > > +  int tried = 0;
> >> > > > > > +  inferior* inf = current_inferior();
> >> > > > > > +  int cpu = crash_set_thread(task);
> >> > > > > > +  if (cpu < 0)
> >> > > > > >      return FALSE;
> >> > > > > >
> >> > > > > > -  /* Making sure that crash's context is same */
> >> > > > > > -  set_cpu(cpu, FALSE);
> >> > > > > > -
> >> > > > > > -  /* Switch to the thread */
> >> > > > > > -  switch_to_thread(tp);
> >> > > > > > -
> >> > > > > > -  /* Fetch/Refresh thread's registers */
> >> > > > > > -  gdb_refresh_regcache(cpu);
> >> > > > > > +  ptid_t ptid = ptid_t(CRASH_INFERIOR_PID, 0, cpu);
> >> > > > > >
> >> > > > > > -  return TRUE;
> >> > > > > > +retry:
> >> > > > > > +   thread_info *tp = find_thread_ptid (inf, ptid);
> >> > > > > > +   if (tp == nullptr && !tried) {
> >> > > > > > +     thread_info *thread = add_thread_silent(target,
> >> > > > > > +                             ptid_t(CRASH_INFERIOR_PID, 0, cpu));
> >> > > > >
> >> > > > > A minor comment.
> >> > > > > Is it possible to do it without needing 'target' as global ? I see
> >> > > > > there's 'inf->current_top_target()', but it's return type if
> >> > > target_ops,
> >> > > > > don't know if it can be used.
> >> > > > >
> >> > > > Thanks for your comments, however I guess it is not workable for the
> >> > > > code change. What we want is to add a thread to the target. I found
> >> > > > one function "target.c:new_thread", which can be used as
> >> > > > "new_thread(current_inferior(), ptid_t(CRASH_INFERIOR_PID, 0, cpu));"
> >> > > > to replace the "add_thread_silent()" code, except we need to make
> >> > > > "new_thread" to be non-static.
> >> > > >
> >> > > > I prefer to keep the code as current, because I don't see the side
> >> > > > effect of making target as global variable. What do you think?
> >> > >
> >> > > Then there doesn't seem to be an alternative, then I guess we will have
> >> > > to keep the global variable, seems okay to me.
> >> > >
> >> > > >
> >> > > > > > <...>
> >> > > > > >
> >> > > > > >  /*
> >> > > > > >   *  Collect the irq_desc[] entry along with its associated handler
> >> > > and
> >> > > > > > diff --git a/task.c b/task.c
> >> > > > > > index a405b05..ef79f53 100644
> >> > > > > > --- a/task.c
> >> > > > > > +++ b/task.c
> >> > > > > > @@ -715,7 +715,8 @@ task_init(void)
> >> > > > > >        * crash_target::fetch_registers, so CPU 0's registers are
> >> > > shown as
> >> > > > > >        * <unavailable> in gdb mode
> >> > > > > >        * */
> >> > > > > > -     gdb_refresh_regcache(0);
> >> > > > > > +     for (int i = 0; i < get_cpus_online(); i++)
> >> > > > > > +             gdb_refresh_regcache(i);
> >> > > > >
> >> > > > > Does x86 require all regcache(s) to be refreshed at init time ?
> >> > > > >
> >> > > > > Till v4 of my ppc series, I also was doing refresh of all CPUs, but
> >> > > it's
> >> > > > > better to do it when required, and there was a review in v4, patch 5
> >> > > > > from Lianbo, quoting it here:
> >> > > > >
> >> > > > > > In addition, I haven't tested how far it takes the time in the
> >> > > > > > multi-core environment, is it possible to implement a smaller
> >> > > > > > granularity refresh operation? For example: When we need switch to
> >> > > > > > another cpu, execute the refresh operation for corresponding
> >> > > regcache.
> >> > > > >
> >> > > > > CPU 0's refresh is required due to gdb having already tried to
> >> > > > > initialised registers before we initialised crash_target
> >> > > > >
> >> > > > > This might add a small noticeable delay on vmcores from big systems.
> >> > > > >
> >> > > > > Adds around 4.6 seconds for a ppc64 vmcore with 64 CPUs, in crash init
> >> > > > > time, while reading it on a fast 48 cpu Power8 system.
> >> > > > > This overhead is less for systems with less CPUs, eg. ~0.43 seconds, on
> >> > > > > a vmcore of 8 CPUs x86_64 system.
> >> > > >
> >> > > > Yes, thanks for the testing results. There do have performance
> >> > > > downgrades of refreshing all cpu caches. But this will give correct
> >> > > > results when user type "info threads" after showing "crash>":
> >> > > >
> >> > > > With all cpu cache refreshed:
> >> > > > crash> info threads
> >> > > >   Id   Target Id         Frame
> >> > > >   1    CPU 0             native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   2    CPU 1             native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   3    CPU 2             <unavailable> in ?? ()
> >> > > >   4    CPU 3             <unavailable> in ?? ()
> >> > > >   5    CPU 4             native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   6    CPU 5             <unavailable> in ?? ()
> >> > > >   7    CPU 6             <unavailable> in ?? ()
> >> > > > * 8    CPU 7             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   9    CPU 8             native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   10   CPU 9             <unavailable> in ?? ()
> >> > > >   11   CPU 10            <unavailable> in ?? ()
> >> > > >   12   CPU 11            native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   13   CPU 12            <unavailable> in ?? ()
> >> > > >   14   CPU 13            native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   15   CPU 14            <unavailable> in ?? ()
> >> > > >   16   CPU 15            <unavailable> in ?? ()
> >> > > >   17   CPU 16            <unavailable> in ?? ()
> >> > > >   18   CPU 17            native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   19   CPU 18            native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   20   CPU 19            <unavailable> in ?? ()
> >> > > >   21   CPU 20            <unavailable> in ?? ()
> >> > > >   22   CPU 21            <unavailable> in ?? ()
> >> > > >   23   CPU 22            native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   24   CPU 23            <unavailable> in ?? ()
> >> > > > crash> q
> >> > > >
> >> > > > With only cpu0 refreshed:
> >> > > > crash> info threads
> >> > > >   Id   Target Id         Frame
> >> > > >   1    CPU 0             native_safe_halt () at
> >> > > > arch/x86/include/asm/irqflags.h:54
> >> > > >   2    CPU 1             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   3    CPU 2             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   4    CPU 3             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   5    CPU 4             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   6    CPU 5             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   7    CPU 6             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > > * 8    CPU 7             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   9    CPU 8             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   10   CPU 9             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   11   CPU 10            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   12   CPU 11            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   13   CPU 12            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   14   CPU 13            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   15   CPU 14            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   16   CPU 15            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   17   CPU 16            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   18   CPU 17            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   19   CPU 18            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   20   CPU 19            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   21   CPU 20            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   22   CPU 21            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   23   CPU 22            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >   24   CPU 23            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> >> > > > reserved=reserved@entry=false) at block/blk-mq.c:640
> >> > > >
> >> > > > As we can see, other cpus[1-6, 8-23] just take the reg cache of
> >> > > > cpu[7], which is incorrect. And if users go further like, "thread 20"
> >> > > > and "gdb bt", it will also give incorrect stack traces.
> >> > > >
> >> > > > The cpu cache will only get refreshed once user type "set <pid>", so
> >> > > > the cpu cache will be refreshed by the <pid> task's context.
> >> > > >
> >> > > > I doubt a user will understand all the details and constraints, so I'm
> >> > > > afraid the user will be confused by the faulty output. But I also have
> >> > > > no objection if the performance is the priority. Basically it is a
> >> > > > balance of pays and gains. In addition, since cmd "info" and "thread"
> >> > > > is a command provided by gdb, currently I don't know how to hack
> >> > > > those, so cpu cache can be refreshed when "info threads" or "thread
> >> > > > <num>" have been invoked.
> >> > > >
> >> > > > Do you have any thoughts?
> >> > >
> >> > > I also had faced that issue initially, ie. the other CPUs using up same
> >> > > regcache, if all are not refreshed.
> >> > > While iterating through all threads, gdb switches it's context
> >> > > temporarily, while crash's context remained same, thus causing gdb to
> >> > > get same registers for all threads other than 0.
> >> > >
> >> > > This was solved in patch #3 (synchronise cpu context changes between
> >> > > crash/gdb)
> >> > > in the ppc's 'Improve stack unwind on ppc64' series, by syncing gdb's
> >> > > context with crash.
> >> > >
> >> > > Can this change in thread.c in gdb-10.2.patch in patch #2 be reverted ?
> >> > > That will fix it.
> >> > >
> >> > > ```
> >> > > ---- gdb-10.2/gdb/thread.c.orig
> >> > > -+++ gdb-10.2/gdb/thread.c
> >> > > -@@ -60,7 +60,7 @@ static thread_info *current_thread_;
> >> > > -
> >> > > - #ifdef CRASH_MERGE
> >> > > - /* Function to set cpu, defined by crash-utility */
> >> > > --extern "C" void set_cpu (int);
> >> > > -+extern "C" void set_cpu(int cpu, int print_context);
> >> > > - #endif
> >> > > -
> >> > > - /* RAII type used to increase / decrease the refcount of each thread
> >> > > -@@ -1098,6 +1098,9 @@ print_thread_info_1 (struct ui_out *uiout, const
> >> > > char *requested_threads,
> >> > > -       uiout->table_body ();
> >> > > -       }
> >> > > -
> >> > > -+#ifdef CRASH_MERGE
> >> > > -+      int current_cpu = current_thread ? current_thread->ptid.tid() : 0;
> >> > > -+#endif
> >> > > -     for (inferior *inf : all_inferiors ())
> >> > > -       for (thread_info *tp : inf->threads ())
> >> > > -       {
> >> > > -@@ -1113,6 +1116,11 @@ print_thread_info_1 (struct ui_out *uiout, const
> >> > > char *requested_threads,
> >> > > -
> >> > > -         ui_out_emit_tuple tuple_emitter (uiout, NULL);
> >> > > -
> >> > > -+        /* Switch to the thread's CPU in crash */
> >> > > -+#ifdef CRASH_MERGE
> >> > > -+        set_cpu(tp->ptid.tid(), FALSE);
> >> > > -+#endif
> >> > > -+
> >> > > ```
> >> > >
> >> > > 'info threads' keeps changing the gdb context temporarily, while it
> >> > > prints each thread's frame and other information.
> >> > > Correspondingly this changes ensured that crash's context also syncs
> >> > > according to change in gdb's context.
> >> > >
> >> > > Then, when user does 'info threads', it will refresh all regcache since
> >> > > it has to go through each thread.
> >> > >
> >> > > In that case that issue of the same frame being printed for all threads
> >> > > should not happen.
> >> > >
> >> > > That should remove the need to refresh all regcache at init time, and
> >> > > leave it to later when user needs it. Refreshing all regcache at init
> >> > > time is the sure shot way that all regcache are correct, but I guess
> >> > > everyone will anyways do a 'info threads', but we can delay the overhead
> >> > > to later. What do you think ?
> >> > >
> >> > > Thanks,
> >> > > Aditya Gupta
> >> > >
> >> > > >
> >> > > > Thanks,
> >> > > > Tao Liu
> >> > > >
> >> > > >
> >> > > >
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Aditya Gupta
> >> > > > >
> >> > > > > >
> >> > > > > >       tt->flags |= TASK_INIT_DONE;
> >> > > > > >  }
> >> > > > > > @@ -5315,7 +5316,7 @@ set_context(ulong task, ulong pid, uint
> >> > > update_gdb_thread)
> >> > > > > >
> >> > > > > >               /* change the selected thread in gdb, according to
> >> > > current context */
> >> > > > > >               if (update_gdb_thread)
> >> > > > > > -                     return gdb_change_cpu_context(tc->processor);
> >> > > > > > +                     return gdb_change_thread_context(tc->task);
> >> > > > > >               else
> >> > > > > >                       return TRUE;
> >> > > > > >       } else {
> >> > > > > > --
> >> > > > > > 2.40.1
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > --
> >> > > Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> >> > > To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> >> > > https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
> >> > > Contribution Guidelines: https://github.com/crash-utility/crash/wiki
> >> > >
--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki