Re: [PATCH] crash: Do not use bt -t flag in panic_search()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



And the subsequent LIVE_DUMP check in panic_search() has also been queued for crash-7.1.3:

  https://github.com/crash-utility/crash/commit/9681db206bb356c9d7c662c847c41879656f541b

Dave
 
----- Original Message -----
> 
> The LIVE_DUMP check in get_dumpfile_panic_task() has been queued for
> crash-7.1.3:
>   
>   https://github.com/crash-utility/crash/commit/a640cbb1b566a7babd5ed6558c9726b2bbf7280c
> 
> Dave
> 
> 
> ----- Original Message -----
> > 
> > 
> > ----- Original Message -----
> > > On Mon, 10 Aug 2015 10:32:12 -0400 (EDT)
> > > Dave Anderson <anderson@xxxxxxxxxx> wrote:
> > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > 
> > > > > On Thu, 6 Aug 2015 11:25:29 -0400 (EDT)
> > > > > Dave Anderson <anderson@xxxxxxxxxx> wrote:
> > > > > 
> > > > > > Re: your dumpfile where the erroneous "panic" address in a random
> > > > > > user
> > > > > > task's exception frame register set gets picked up by mistake.
> > > > > > 
> > > > > > Your original patch request modified the "bt" command used for the
> > > > > > kernel stack searches in panic_search().  But that piece of code
> > > > > > is the last-ditch effort for finding a panic task, which follows
> > > > > > this path:
> > > > > > 
> > > > > >   get_panic_context()
> > > > > >     panic_search()
> > > > > >       get_dumpfile_panic_task()
> > > > > >         get_kdump_panic_task()       (requires kdump "crashing_cpu"
> > > > > >         symbol)
> > > > > >         get_diskdump_panic_task()    (requires kdump "crashing_cpu"
> > > > > >         symbol)
> > > > > 
> > > > > On s390 we don't have the "crashing_cpu" symbol in the kernel.
> > > > > 
> > > > > >         get_active_set_panic_task()  (bt -r raw stack dump of
> > > > > >         active
> > > > > >         cpus)
> > > > > >     ...
> > > > > >       
> > > > > > Only if all of the above fail, does panic_search() initiate the
> > > > > > exhaustive walkthrough of all kernel stacks for evidence.
> > > > > > 
> > > > > > Since you have gotten that far, I'm wondering whether your
> > > > > > target dumpfile with the faulty "panic" address is from an
> > > > > > s390x "live dump"?  In that case, there can never be any task
> > > > > > with any such evidence, making the backtrace search a waste of
> > > > > > time to begin with.
> > > > > 
> > > > > The "problem" dump is a s390 stand-alone dump of a hanging system.
> > > > > All CPUs have been in "psw_idle" when the dump was generated:
> > > > > 
> > > > > PID: 0      TASK: c50f38            CPU: 0   COMMAND: "swapper/0"
> > > > >  LOWCORE INFO:
> > > > >   -psw      : 0x0706c00180000000 0x000000000084410e
> > > > >   -function : psw_idle at 84410e
> > > > > 
> > > > > [snip]
> > > > > 
> > > > >  #0 [00c1fe70] arch_cpu_idle at 104d4a
> > > > >  #1 [00c1fe90] cpu_startup_entry at 180430
> > > > >  #2 [00c1fee8] start_kernel at d1fb10
> > > > >  #3 [00c1ff60] _stext at 100020
> > > > > 
> > > > > 
> > > > > > 
> > > > > > And if so, I'm thinking that since s390x will have set LIVE_DUMP
> > > > > > flag set, if get_dumpfile_panic_task() returns NO_TASK, then
> > > > > > panic_search() should just return a NULL to get_panic_context()
> > > > > > if it's a live dump, which will just default to the idle task on
> > > > > > cpu 0.
> > > > > 
> > > > > Although it does not solve the above problem it makes sense for
> > > > > live dumps. What about the following patch?
> > > > > ---
> > > > > crash: do not search panic tasks for live dumps
> > > > > 
> > > > > Always return "NO_TASK" if the "LIVE_DUMP" flag is set because live
> > > > > dumps
> > > > > cannot have a panic task.
> > > > > 
> > > > > Signed-off-by: Michael Holzheu <holzheu@xxxxxxxxxxxxxxxxxx>
> > > > > ---
> > > > >  task.c |    5 ++++-
> > > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > > 
> > > > > --- a/task.c
> > > > > +++ b/task.c
> > > > > @@ -6726,7 +6726,10 @@ get_dumpfile_panic_task(void)
> > > > >  {
> > > > >  	ulong task;
> > > > >  
> > > > > -	if (NETDUMP_DUMPFILE()) {
> > > > > +	if (pc->flags2 & LIVE_DUMP) {
> > > > > +		/* No panic task because system itself created the dump */
> > > > > +		return NO_TASK;
> > > > > +	} else if (NETDUMP_DUMPFILE()) {
> > > > >  		task = pc->flags & REM_NETDUMP ?
> > > > >  			tt->panic_task : get_netdump_panic_task();
> > > > >  		if (task)
> > > > > 
> > > > 
> > > > That makes sense, but I'm going to move the LIVE_DUMP check farther
> > > > down
> > > > in get_dumpfile_panic_task() to just before the get_active_set() call.
> > > > 
> > > 
> > > Makes sense. That was also my first idea.
> > > 
> > > > The reason for that another type of "LIVE_DUMP" is from the snap.so
> > > > extension
> > > > module, and in that case, get_kdump_panic_task() finds and returns the
> > > > "crash"
> > > > task that was running the snap command on the live system.
> > > > 
> > > > Clarify something else for me: are there actually two types of live
> > > > dumps
> > > > that can be taken by an s390x?  There is the "zgetdump" facility, but
> > > > is
> > > > there also another type that is taken by the firmware and/or the
> > > > hypervisor?
> > > 
> > > With the zgetdump tool we create live dumps from /dev/mem or /dev/crash.
> > > These dumps get the LIVE_DUMP flag indicating that data is not
> > > consistent.
> > > 
> > > Besides of this, we have two other non-disruptive live dump features:
> > > 
> > >   - VMDUMP for z/VM guests
> > >   - Virsh dump for KVM guests
> > > 
> > > In contrast to the zgetdump method here the guest system is stopped
> > > to get consistent snapshots. Therefore I think it is fine to *not* set
> > > the LIVE_DUMP flag.
> > > 
> > > Besides of those live dump mechanisms (and kdump) we have our stand-alone
> > > dump
> > > tools for DASD and SCSI. Also these dump methods are "Linux independent"
> > > and
> > > therefore can produce dumps without panic tasks.
> > > > > You can read more on s390 dump in the documents below:
> > > 
> > >  * http://www.vm.ibm.com/education/lvc/LVC1219.pdf
> > >  *
> > >  http://www-01.ibm.com/support/knowledgecenter/linuxonibm/liaaf/lnz_r_dt.html?cp=linuxonibm%2F0-4-0-1
> > > 
> > > Michael
> > 
> > OK, so from what I understand, there still can be s390x dumpfiles which
> > have no indication
> > of the panic task or cpu (if there is one) in their headers, and therefore
> > may try the "bt -r"
> > type search of the active tasks via raw_stack_dump() in
> > get_active_set_panic_task(),
> > and if that fails, fall back to the "bt -t" search of all tasks in
> > panic_search().
> > 
> > In those cases, I suppose you could:
> > 
> >  (1) restrict the raw_stack_dump() parameters in
> >  get_active_set_panic_task() to exclude
> >      the user register dump at the top of the stack, and
> >  (2) plug in a MACHDEP_BT_TEXT handler for the s390x instead of using the
> >  generic version,
> >      and in that case, could prevent the search from entering the
> >      user-space register dump
> >      at the top of the stack, or
> > (2a) replace "bt -t" with just "bt" in panic_search() for s390x as you did
> > in the original
> >      patch.
> > 
> > But (1) and (2) are not fool-proof, because even the kernel-only part of
> > the stack could
> > simply contain "numbers" that by dumb luck fall into the zero-based virtual
> > address
> > range of panic, crash_kexec, etc., and return a false positive.  So I don't
> > know
> > how that can be made absolutely reliable.
> > 
> > But at least with dumpfiles that have the live dump magic number (and I'm
> > still
> > not clear which of the 4 types do so), the simple LIVE_PATCH-check patch
> > covers
> > them.  I'm not sure whether it's worth doing anything beyond that.
> > 
> > Dave

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility



[Index of Archives]     [Fedora Development]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]     [Fedora Tools]

 

Powered by Linux