----- Original Message ----- > Hello Dave, > > On Mon, 2011-09-19 at 11:05 -0400, Dave Anderson wrote: > > > WARNING: multiple active tasks have called die and/or panic > > > WARNING: multiple active tasks have called die > > > > > > In task.c we call "foreach bt -t" and check if we find "die" on the stack. When > > > doing this on s390 with the "-t" option normally we find multiple die() calls > > > for one single task: > > > > > > crash> foreach bt -t | grep "die at" > > > [ 9ca7f7f0] die at 100f26 > > > [ 9ca7f8f0] die at 100f26 > > > [ 9ca7f9b8] die at 100f26 > > > [ 9ca7fa40] die at 100ee6 > > > [ 9ca7fa90] die at 100f26 > > > > > > The current code then assumes that multiple tasks have called die(). > > > > > > This patch fixes this problem by an additional check that allows multiple > > > occurrences of the die() call on the stack (with bt -t) for one task. > > > > Strange -- has this always happened on s390's? > > I don't think so, although I have seen that warning already several > times in the past. But until now I did not had the time to look into > this issue. > > > And I wonder why there are multiple instances on the stack? > > I think the reason is the -t option. It just finds multiple instances of > addresses that point to the die() function on the stack. I don't know > the exact reason, but the compiler can place whatever it wants to the > stack. > > The current stack pointer is 9d7d3768 and the stack area is [9d7d3768 - 9d7d4000]: > > crash> bt -t | grep die > [ 9d7d38b8] die at 100f26 > [ 9d7d3988] die at 100f26 > [ 9d7d3a40] die at 100ee6 > [ 9d7d3a90] die at 100f26 > > crash> rd 9d7d38b8 > 9d7d38b8: 0000000000100f26 .......& > crash> rd 9d7d3988 > 9d7d3988: 0000000000100f26 .......& > crash> rd 9d7d3a40 > 9d7d3a40: 0000000000100ee6 ........ > crash> rd 9d7d3a90 > 9d7d3a90: 0000000000100f26 .......& Given that the kernel cannot return from die() if panic_on_oops is set, and given the s390's PAGE_OFFSET of 0, I wonder if those values of 100f26 and 100ee6 possibly just fall into die()'s address range by "dumb luck", but have other uses/meanings? If you disassemble the die() function, you could at least verify whether they are return addresses or not. > > > What does the actual backtrace look like? > > The "normal" backtrace looks like the following: > > crash> bt > PID: 10 TASK: 9d7bdba0 CPU: 0 COMMAND: "kworker/0:1" > LOWCORE INFO: > -psw : 0x0400100180000000 0x0000000000114630 > -function : store_status at 114630 > -prefix : 0x7ff08000 > -cpu timer: 0x7fff15c0 0x0066b7fa > -clock cmp: 0x0066b7fa 0000000000 > -general registers: > 000000000000000000 0x00000000001060a0 > 0x0400000180000000 0x000000009cb1ec00 > 0x000000000011d48c 0x0000000000000040 > 000000000000000000 0x00000000009c8c68 > 0x000000009cb1ec00 0x000000000011d4ac > 0x000000009cb1ec00 0x000000000011dc18 > 0x000000009cb1ec00 0x00000000005b9870 > 0x0000000000111d08 0x000000009d7d3768 > -access registers: > 0x000003ff 0xfd3f76f0 0000000000 0000000000 > 0000000000 0000000000 0000000000 0000000000 > 0000000000 0000000000 0000000000 0000000000 > 0000000000 0000000000 0000000000 0000000000 > -control registers: > 0x0000000004046e12 0x00000000009c2007 > 0x0000000000011140 000000000000000000 > 0x000000000000000a 0x0000000000011140 > 0x0000000051000000 0x00000000009c2007 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 0x00000000901bc1c7 > 0x00000000db000000 000000000000000000 > -floating point registers 0,2,4,6: > 0x4048000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > 000000000000000000 000000000000000000 > > #0 [9d7d37a8] __machine_kexec at 11d4fa > #1 [9d7d37f0] smp_switch_to_ipl_cpu at 116ebe > #2 [9d7d3860] machine_kexec at 11d49c > #3 [9d7d3890] crash_kexec at 19ab26 > #4 [9d7d3960] panic at 5af192 > #5 [9d7d3a08] die at 100f26 > #6 [9d7d3a70] do_no_context at 11e910 > #7 [9d7d3aa8] do_protection_exception at 5b551a > #8 [9d7d3bc0] pgm_exit at 5b34b8 > PSW: 0404100180000000 0000000000402d04 (sysrq_handle_crash+16) > GPRS: 0000000000010000 00000000009c8c74 0000000000000001 0000000000000000 > 00000000005af34e 00000000009c90e4 000000000091d3b0 0000000000a67960 > 070000000016b628 0000000000000001 0000000000959530 0000000000000063 > 00000000009596d0 0000000000606c60 000000000040309c 000000009d7d3d08 > #0 [9d7d3d70] process_one_work at 166abe > #1 [9d7d3dd8] worker_thread at 1672da > #2 [9d7d3e50] kthread at 1705b6 > #3 [9d7d3eb8] kernel_thread_starter at 5b2e3a > > > > In any case, I guess the patch makes sense, > > although I wonder why nobody else has ever reported this. > > I assume that everybody has just ignored the warning... > > > By any chance, given that this must be zdump-type dumpfile (?), does > > the "dh_cpu_id" member in the header correlate to the panic cpu? > > Not necessarily. We have code that switches to the original boot CPU in > case of panic. So the dumping CPU normally is not the CPU that called panic(). > > > Or is there any other way that the panic'ing task can be ascertained from > > "S390D" dumpfiles such that get_dumpfile_panic_task() can do the job? > > Hmmm, I don't think so. Probably the only way is to search die or panic > on the stack. Perhaps we can do that without the -t option? OK. The -t option is a fall-back if the dumpfile format doesn't offer a better way, or if the arch-specific kernel kdump code doesn't have a "crashing_cpu" variable. It's usually fool-proof, but again, with the s390's zero PAGE_OFFSET, random numbers can be confused for kernel virtual addresses. Anyway, your patch does close a hole with the fall-back scheme, so it is queued for crash-5.1.9. Thanks, Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility