[Crash-utility] Re: [PATCH 2/5] Enable crash to change gdb thread context

Tao Liu <ltao@xxxxxxxxxx> · Fri, 15 Mar 2024 20:33:31 +0800

Hi Aditya,

On Wed, Mar 13, 2024 at 5:33 PM Aditya Gupta <adityag@xxxxxxxxxxxxx> wrote:
>
> Hi Tao,
> On Wed, Mar 13, 2024 at 11:23:56AM +0800, Tao Liu wrote:
> > Hi Aditya,
> >
> > On Tue, Mar 12, 2024 at 5:12 PM Aditya Gupta <adityag@xxxxxxxxxxxxx> wrote:
> > >
> > > Hi Tao,
> > >
> > > > <...>
> > > >
> > > > +crash_target *target = NULL;
> > > > +
> > > >  void
> > > >  crash_target_init (void)
> > > >  {
> > > >    int nr_cpus = crash_get_nr_cpus();
> > > > -  crash_target *target = new crash_target ();
> > > > +  target = new crash_target ();
> > > >
> > > >    /* Own the target until it is successfully pushed.  */
> > > >    target_ops_up target_holder (target);
> > > > @@ -137,29 +139,35 @@ crash_target_init (void)
> > > >    reinit_frame_cache ();
> > > >  }
> > > >
> > > > -/*
> > > > - * Change gdb's thread context to the thread on given CPU
> > > > - **/
> > > >  extern "C" int
> > > > -gdb_change_cpu_context(unsigned int cpu)
> > > > +gdb_change_thread_context (ulong task)
> > > >  {
> > > > -  ptid_t ptid = ptid_t(CRASH_INFERIOR_PID, 0, cpu);
> > > > -  inferior *inf = current_inferior ();
> > > > -  thread_info *tp = find_thread_ptid (inf, ptid);
> > > > -
> > > > -  if (tp == nullptr)
> > > > +  int tried = 0;
> > > > +  inferior* inf = current_inferior();
> > > > +  int cpu = crash_set_thread(task);
> > > > +  if (cpu < 0)
> > > >      return FALSE;
> > > >
> > > > -  /* Making sure that crash's context is same */
> > > > -  set_cpu(cpu, FALSE);
> > > > -
> > > > -  /* Switch to the thread */
> > > > -  switch_to_thread(tp);
> > > > -
> > > > -  /* Fetch/Refresh thread's registers */
> > > > -  gdb_refresh_regcache(cpu);
> > > > +  ptid_t ptid = ptid_t(CRASH_INFERIOR_PID, 0, cpu);
> > > >
> > > > -  return TRUE;
> > > > +retry:
> > > > +   thread_info *tp = find_thread_ptid (inf, ptid);
> > > > +   if (tp == nullptr && !tried) {
> > > > +     thread_info *thread = add_thread_silent(target,
> > > > +                             ptid_t(CRASH_INFERIOR_PID, 0, cpu));
> > >
> > > A minor comment.
> > > Is it possible to do it without needing 'target' as global ? I see
> > > there's 'inf->current_top_target()', but it's return type if target_ops,
> > > don't know if it can be used.
> > >
> > Thanks for your comments, however I guess it is not workable for the
> > code change. What we want is to add a thread to the target. I found
> > one function "target.c:new_thread", which can be used as
> > "new_thread(current_inferior(), ptid_t(CRASH_INFERIOR_PID, 0, cpu));"
> > to replace the "add_thread_silent()" code, except we need to make
> > "new_thread" to be non-static.
> >
> > I prefer to keep the code as current, because I don't see the side
> > effect of making target as global variable. What do you think?
>
> Then there doesn't seem to be an alternative, then I guess we will have
> to keep the global variable, seems okay to me.
>
> >
> > > > <...>
> > > >
> > > >  /*
> > > >   *  Collect the irq_desc[] entry along with its associated handler and
> > > > diff --git a/task.c b/task.c
> > > > index a405b05..ef79f53 100644
> > > > --- a/task.c
> > > > +++ b/task.c
> > > > @@ -715,7 +715,8 @@ task_init(void)
> > > >        * crash_target::fetch_registers, so CPU 0's registers are shown as
> > > >        * <unavailable> in gdb mode
> > > >        * */
> > > > -     gdb_refresh_regcache(0);
> > > > +     for (int i = 0; i < get_cpus_online(); i++)
> > > > +             gdb_refresh_regcache(i);
> > >
> > > Does x86 require all regcache(s) to be refreshed at init time ?
> > >
> > > Till v4 of my ppc series, I also was doing refresh of all CPUs, but it's
> > > better to do it when required, and there was a review in v4, patch 5
> > > from Lianbo, quoting it here:
> > >
> > > > In addition, I haven't tested how far it takes the time in the
> > > > multi-core environment, is it possible to implement a smaller
> > > > granularity refresh operation? For example: When we need switch to
> > > > another cpu, execute the refresh operation for corresponding regcache.
> > >
> > > CPU 0's refresh is required due to gdb having already tried to
> > > initialised registers before we initialised crash_target
> > >
> > > This might add a small noticeable delay on vmcores from big systems.
> > >
> > > Adds around 4.6 seconds for a ppc64 vmcore with 64 CPUs, in crash init
> > > time, while reading it on a fast 48 cpu Power8 system.
> > > This overhead is less for systems with less CPUs, eg. ~0.43 seconds, on
> > > a vmcore of 8 CPUs x86_64 system.
> >
> > Yes, thanks for the testing results. There do have performance
> > downgrades of refreshing all cpu caches. But this will give correct
> > results when user type "info threads" after showing "crash>":
> >
> > With all cpu cache refreshed:
> > crash> info threads
> >   Id   Target Id         Frame
> >   1    CPU 0             native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   2    CPU 1             native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   3    CPU 2             <unavailable> in ?? ()
> >   4    CPU 3             <unavailable> in ?? ()
> >   5    CPU 4             native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   6    CPU 5             <unavailable> in ?? ()
> >   7    CPU 6             <unavailable> in ?? ()
> > * 8    CPU 7             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   9    CPU 8             native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   10   CPU 9             <unavailable> in ?? ()
> >   11   CPU 10            <unavailable> in ?? ()
> >   12   CPU 11            native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   13   CPU 12            <unavailable> in ?? ()
> >   14   CPU 13            native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   15   CPU 14            <unavailable> in ?? ()
> >   16   CPU 15            <unavailable> in ?? ()
> >   17   CPU 16            <unavailable> in ?? ()
> >   18   CPU 17            native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   19   CPU 18            native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   20   CPU 19            <unavailable> in ?? ()
> >   21   CPU 20            <unavailable> in ?? ()
> >   22   CPU 21            <unavailable> in ?? ()
> >   23   CPU 22            native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   24   CPU 23            <unavailable> in ?? ()
> > crash> q
> >
> > With only cpu0 refreshed:
> > crash> info threads
> >   Id   Target Id         Frame
> >   1    CPU 0             native_safe_halt () at
> > arch/x86/include/asm/irqflags.h:54
> >   2    CPU 1             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   3    CPU 2             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   4    CPU 3             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   5    CPU 4             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   6    CPU 5             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   7    CPU 6             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> > * 8    CPU 7             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   9    CPU 8             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   10   CPU 9             blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   11   CPU 10            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   12   CPU 11            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   13   CPU 12            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   14   CPU 13            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   15   CPU 14            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   16   CPU 15            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   17   CPU 16            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   18   CPU 17            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   19   CPU 18            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   20   CPU 19            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   21   CPU 20            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   22   CPU 21            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   23   CPU 22            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >   24   CPU 23            blk_mq_rq_timed_out (req=0xffff880fdb246000,
> > reserved=reserved@entry=false) at block/blk-mq.c:640
> >
> > As we can see, other cpus[1-6, 8-23] just take the reg cache of
> > cpu[7], which is incorrect. And if users go further like, "thread 20"
> > and "gdb bt", it will also give incorrect stack traces.
> >
> > The cpu cache will only get refreshed once user type "set <pid>", so
> > the cpu cache will be refreshed by the <pid> task's context.
> >
> > I doubt a user will understand all the details and constraints, so I'm
> > afraid the user will be confused by the faulty output. But I also have
> > no objection if the performance is the priority. Basically it is a
> > balance of pays and gains. In addition, since cmd "info" and "thread"
> > is a command provided by gdb, currently I don't know how to hack
> > those, so cpu cache can be refreshed when "info threads" or "thread
> > <num>" have been invoked.
> >
> > Do you have any thoughts?
>
> I also had faced that issue initially, ie. the other CPUs using up same
> regcache, if all are not refreshed.
> While iterating through all threads, gdb switches it's context
> temporarily, while crash's context remained same, thus causing gdb to
> get same registers for all threads other than 0.
>
> This was solved in patch #3 (synchronise cpu context changes between crash/gdb)
> in the ppc's 'Improve stack unwind on ppc64' series, by syncing gdb's
> context with crash.
>
> Can this change in thread.c in gdb-10.2.patch in patch #2 be reverted ?
> That will fix it.

Could you share your patch, based on your v10 and my v1 patch series,
so I can get a clue how to do this?

I tried but was unsuccessful. Since I have changed your #3 patch a bit
in my v1 patch series, such as gdb_change_cpu_context() ->
gdb_change_thread_context(), I doubt that's the reason for failing.

What I did is keeping "set_cpu" in thread.c:thread_command() as the
gdb-10.2.patch describes in your #3 patch. But only one thread gets
refreshed when I invoke "thread X", and no regcache refreshed when
invoke "info threads".

Thanks,
Tao Liu

>
> ```
> ---- gdb-10.2/gdb/thread.c.orig
> -+++ gdb-10.2/gdb/thread.c
> -@@ -60,7 +60,7 @@ static thread_info *current_thread_;
> -
> - #ifdef CRASH_MERGE
> - /* Function to set cpu, defined by crash-utility */
> --extern "C" void set_cpu (int);
> -+extern "C" void set_cpu(int cpu, int print_context);
> - #endif
> -
> - /* RAII type used to increase / decrease the refcount of each thread
> -@@ -1098,6 +1098,9 @@ print_thread_info_1 (struct ui_out *uiout, const char *requested_threads,
> -       uiout->table_body ();
> -       }
> -
> -+#ifdef CRASH_MERGE
> -+      int current_cpu = current_thread ? current_thread->ptid.tid() : 0;
> -+#endif
> -     for (inferior *inf : all_inferiors ())
> -       for (thread_info *tp : inf->threads ())
> -       {
> -@@ -1113,6 +1116,11 @@ print_thread_info_1 (struct ui_out *uiout, const char *requested_threads,
> -
> -         ui_out_emit_tuple tuple_emitter (uiout, NULL);
> -
> -+        /* Switch to the thread's CPU in crash */
> -+#ifdef CRASH_MERGE
> -+        set_cpu(tp->ptid.tid(), FALSE);
> -+#endif
> -+
> ```
>
> 'info threads' keeps changing the gdb context temporarily, while it
> prints each thread's frame and other information.
> Correspondingly this changes ensured that crash's context also syncs
> according to change in gdb's context.
>
> Then, when user does 'info threads', it will refresh all regcache since
> it has to go through each thread.
>
> In that case that issue of the same frame being printed for all threads
> should not happen.
>
> That should remove the need to refresh all regcache at init time, and
> leave it to later when user needs it. Refreshing all regcache at init
> time is the sure shot way that all regcache are correct, but I guess
> everyone will anyways do a 'info threads', but we can delay the overhead
> to later. What do you think ?
>
> Thanks,
> Aditya Gupta
>
> >
> > Thanks,
> > Tao Liu
> >
> >
> >
> > >
> > > Thanks,
> > > Aditya Gupta
> > >
> > > >
> > > >       tt->flags |= TASK_INIT_DONE;
> > > >  }
> > > > @@ -5315,7 +5316,7 @@ set_context(ulong task, ulong pid, uint update_gdb_thread)
> > > >
> > > >               /* change the selected thread in gdb, according to current context */
> > > >               if (update_gdb_thread)
> > > > -                     return gdb_change_cpu_context(tc->processor);
> > > > +                     return gdb_change_thread_context(tc->task);
> > > >               else
> > > >                       return TRUE;
> > > >       } else {
> > > > --
> > > > 2.40.1
> > > >
> > >
> >
>
--
Crash-utility mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxxxxxx
https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/
Contribution Guidelines: https://github.com/crash-utility/crash/wiki