----- Original Message ----- > > > On Tuesday 24 January 2017 11:53 PM, Dave Anderson wrote: > > > > ----- Original Message ----- > >> > >> On Monday 23 January 2017 11:43 PM, Dave Anderson wrote: > >>> ----- Original Message ----- > >>>> On Saturday 21 January 2017 02:00 AM, Dave Anderson wrote: > >>>>> ----- Original Message ----- > >>>>> > >>>>> ... [cut] ... > >>>>> > >>>>>>> Also, the exception frame doesn't even show the [bracketed] type of > >>>>>>> exception > >>>>>>> that occurred -- it's just a register dump followed by the remainder > >>>>>>> of > >>>>>>> the > >>>>>>> backtrace. Upon a quick glance, it's not obvious that they are even > >>>>>>> active > >>>>>>> tasks. And traditionally, all of the other architectures have always > >>>>>>> dumped > >>>>>>> a full trace. > >>>>>>> > >>>>>>> I'm not sure what the mechanism is for shutting down the non-active > >>>>>>> FADUMP tasks, so that's why I asked if you could restrict this change > >>>>>>> to just those types of dumps. (For that matter, is it even possible > >>>>>>> to > >>>>>>> differentiate a real kdump from an FADUMP dumpfile -- aside from a > >>>>>> Hi Dave, > >>>>>> > >>>>>> Differentiating a kdump and fadump dumpfile is not possible except > >>>>>> that > >>>>>> the > >>>>>> stack search would invariably fail and ptregs are guaranteed to be > >>>>>> saved > >>>>>> by > >>>>>> firmware in case of fadump. Posted v2 that doesn't change bt output > >>>>>> for > >>>>>> anything > >>>>>> but active tasks in case of fadump.. > >>>>> Ok, so let me get this straight. The only difference I see with the v2 > >>>>> patch > >>>>> is that fadump non-panicking active tasks change from something like > >>>>> this: > >>>>> > >>>>> PID: 0 TASK: c000000000e7f6d0 CPU: 0 COMMAND: "swapper" > >>>>> #0 [c000000000f2ba30] (null) at 3aae291c67 (unreliable) > >>>>> #1 [c000000000f2bae0] .tick_dev_program_event at c0000000000d16fc > >>>>> #2 [c000000000f2bb90] .__hrtimer_start_range_ns at > >>>>> c0000000000c4bcc > >>>>> #3 [c000000000f2bcb0] .tick_nohz_stop_sched_tick at > >>>>> c0000000000d2d30 > >>>>> #4 [c000000000f2bdc0] .cpu_idle at c000000000015bf0 > >>>>> #5 [c000000000f2be70] .rest_init at c000000000009de4 > >>>>> #6 [c000000000f2bef0] .start_kernel at c000000000850eb4 > >>>>> #7 [c000000000f2bf90] .start_here_common at c0000000000083d8 > >>>>> > >>>>> to this: > >>>>> > >>>>> PID: 0 TASK: c000000000e7f6d0 CPU: 0 COMMAND: "swapper" > >>>>> #0 [c000000000f2bd50] (null) at 0 (unreliable) > >>>>> #1 [c000000000f2bdc0] .cpu_idle at c000000000015bf0 > >>>>> #2 [c000000000f2be70] .rest_init at c000000000009de4 > >>>>> #3 [c000000000f2bef0] .start_kernel at c000000000850eb4 > >>>>> #4 [c000000000f2bf90] .start_here_common at c0000000000083d8 > >>>>> > >>>>> But with your v1 patch, you also dumped the exception frame: > >>>>> > >>>>> PID: 0 TASK: c000000000e7f6d0 CPU: 0 COMMAND: "swapper" > >>>>> R0: 0000000000000000 R1: c000000000f2bd50 R2: > >>>>> c000000000f27628 > >>>>> R3: 0000000000000000 R4: 0000000000000000 R5: > >>>>> 8000000002144400 > >>>>> R6: 800000001314c4f8 R7: 0000000000000000 R8: > >>>>> 0000000000000000 > >>>>> R9: ffffffffffffffff R10: 0000000000000000 R11: > >>>>> 80003fbff901700c > >>>>> R12: 0000000000000000 R13: c000000000ff2500 R14: > >>>>> 0000000001a3fa58 > >>>>> R15: 00000000002230a8 R16: 0000000000223150 R17: > >>>>> 0000000000223144 > >>>>> R18: 0000000000c8a098 R19: 0000000002b13a58 R20: > >>>>> 0000000000000000 > >>>>> R21: 0000000002b135d8 R22: 0000000002b13530 R23: > >>>>> 0000000002280000 > >>>>> R24: 0000000002b135f0 R25: c000000000fd5c48 R26: > >>>>> c0000000010942f0 > >>>>> R27: c0000000010942f0 R28: c0000000005fd168 R29: > >>>>> 0000000000000008 > >>>>> R30: c000000000eb1d68 R31: c000000000f28080 > >>>>> NIP: c000000000055730 MSR: 8000000000009032 OR3: > >>>>> 0000000000000000 > >>>>> CTR: 0000000000000000 LR: c000000000057350 XER: > >>>>> 0000000000000000 > >>>>> CCR: 0000000024000048 MQ: 0000000000000000 DAR: > >>>>> 000001000ad763b0 > >>>>> DSISR: 0000000000000000 Syscall Result: 0000000000000000 > >>>>> NIP [c000000000055730] .plpar_hcall_norets > >>>>> LR [c000000000057350] .pseries_shared_idle_sleep > >>>>> #0 [c000000000f2bd50] (null) at 0 (unreliable) > >>>>> #1 [c000000000f2bdc0] .cpu_idle at c000000000015bf0 > >>>>> #2 [c000000000f2be70] .rest_init at c000000000009de4 > >>>>> #3 [c000000000f2bef0] .start_kernel at c000000000850eb4 > >>>>> #4 [c000000000f2bf90] .start_here_common at c0000000000083d8 > >>>>> > >>>>> Again, I don't understand how the non-panicking active tasks are > >>>>> stopped > >>>>> by the fadump facility, but is it because you cannot differentiate > >>>>> kdumps > >>>>> from fadumps that you don't show the exception frame with the v2 patch? > >>>> Hi Dave, > >>>> > >>>> The crashing cpu makes rtas call ibm,os-term to the firmware which > >>>> saves the regs info of all online cpus. AFAIK, there is no exception > >>>> frame > >>>> marker (which we are using to detect one) set for this stack frames > >>>> by the kernel. With v1, I was printing the registers without looking for > >>>> exception frame marker, if the registers are saved.. > >>>> > >>>>> Would it be possible to also show the exception frame type in brackets > >>>>> and > >>>>> the register dump for those fadump non-panicking active tasks? > >>>>> > >>>> Hmmm.. Let me have a hard look at this. > >>>> Will try and improve this.. > >>> Hari, > >>> > >>> I was tinkering around with ppc64_get_dumpfile_stack_frame() from your v2 > >>> patch, > >>> and this seems to work: > >>> > >>> else { > >>> *ksp = pt_regs->gpr[1]; > >>> if (IS_KVADDR(*ksp)) { > >>> readmem(*ksp+16, KVADDR, nip, sizeof(ulong), > >>> "Regs NIP value", FAULT_ON_ERROR); > >>> + ppc64_print_regs(pt_regs); > >>> return TRUE; > >>> } else { > >>> if (IN_TASK_VMA(bt_in->task, *ksp)) > >>> fprintf(fp, "%0lx: Task is running in > >>> user > >>> space\n", > >>> bt_in->task); > >>> else > >>> fprintf(fp, "%0lx: Invalid Stack > >>> Pointer > >>> %0lx\n", > >>> bt_in->task, *ksp); > >>> *nip = pt_regs->nip; > >>> ppc64_print_regs(pt_regs); > >>> return FALSE; > >>> } > >>> } > >>> > >>> And if the task were to have been running in userspace, it already dumps > >>> the > >>> registers in the "else" section above. > >>> > >>> I see that the regs->trap is 0, so I understand now that there's nothing > >>> to > >>> translate w/respect to the exception frame type, but a follow-up > >>> translation > >>> of the NIP and LR would at least show that there was some kind of > >>> hypercall > >>> involved. (Whether it can be firmly determined whether FADUMP was > >>> responsible > >>> is another question) > >>> > >>> > >> Hi Dave, > >> > >> I did think of it but I was wary considering two register prints like > >> below, > >> if there is an exception frame.. > >> > >> PID: 2121 TASK: c0000001af90c600 CPU: 2 COMMAND: "sshd" > >> R0: c0000000003e5280 R1: c0000001ae047a30 R2: > >> c000000000fd5a00 > >> R3: 0000000000000001 R4: 000000000000019e R5: > >> 000000000000000f > >> R6: 0000000000000004 R7: c0000001ae047bb8 R8: > >> 00000000000b3d9f > >> R9: 00000000000000f0 R10: 0000000000000678 R11: > >> c0000000008e0f38 > >> R12: c0000000003e6310 R13: c00000000b781200 R14: > >> 0000000000000000 > >> R15: 0000000000000000 R16: 000001000b7dad70 R17: > >> 000000005dfd3c08 > >> R18: 000000005dfd2838 R19: 00003ffff81eb620 R20: > >> 000000005df74128 > >> R21: 000001000b7d89a0 R22: 000000000000de4c R23: > >> 000000005df73b30 > >> R24: 000000005dfd3c88 R25: 00003ffff81eb428 R26: > >> c0000001ae047bb8 > >> R27: c0000001b17f4d80 R28: c000000000c60580 R29: > >> 000000000000019e > >> R30: 000000000000000f R31: 000000000000090b > >> NIP: 00003fffb6ac8400 MSR: 800000000000d033 OR3: > >> 0000000000000000 > >> CTR: c0000000003e6310 LR: c0000000003e493c XER: > >> 0000000020000000 > >> CCR: 0000000024004824 MQ: 0000000000000000 DAR: > >> 000001000b7e1640 > >> DSISR: 0000000002000000 Syscall Result: 0000000000000000 > >> #0 [c0000001ae047a30] (null) at c0000000fd783c00 (unreliable) > >> #1 [c0000001ae047a70] avc_has_perm at c0000000003e5280 > >> #2 [c0000001ae047b60] sock_has_perm at c0000000003e6238 > >> #3 [c0000001ae047be0] security_socket_sendmsg at c0000000003e28fc > >> #4 [c0000001ae047c30] sock_sendmsg at c00000000072d53c > >> #5 [c0000001ae047c60] sock_write_iter at c00000000072d644 > >> #6 [c0000001ae047d00] __vfs_write at c0000000002ed97c > >> #7 [c0000001ae047d90] vfs_write at c0000000002ef328 > >> #8 [c0000001ae047de0] sys_write at c0000000002f0f00 > >> #9 [c0000001ae047e30] system_call at c00000000000b184 > >> System Call [c00] exception frame: > >> R0: 0000000000000004 R1: 00003ffff81eb220 R2: > >> 00003fffb6b99800 > >> R3: 0000000000000003 R4: 000001000b80e3c0 R5: > >> 0000000000000034 > >> R6: 00003ffff81eb2e4 R7: 000000000000021e R8: > >> 0000000000000000 > >> R9: 0000000000000000 R10: 0000000000000000 R11: > >> 0000000000000000 > >> R12: 0000000000000000 R13: 00003fffb6497730 R14: > >> 0000000000000000 > >> R15: 0000000000000000 R16: 000001000b7dad70 R17: > >> 000000005dfd3c08 > >> R18: 000000005dfd2838 R19: 00003ffff81eb620 R20: > >> 000000005df74128 > >> R21: 000001000b7d89a0 R22: 000000000000de4c R23: > >> 000000005df73b30 > >> R24: 000000005dfd3c88 R25: 00003ffff81eb428 R26: > >> 00003ffff81eb430 > >> R27: 00003ffff81eb420 R28: 00003ffff81eb424 R29: > >> 00003ffff81eb2e4 > >> R30: 000001000b80e3c0 R31: 0000000000000034 > >> NIP: 00003fffb6ac8400 MSR: 800000000000d033 OR3: > >> 0000000000000003 > >> CTR: 0000000000000000 LR: 000000005df1c3e4 XER: > >> 0000000000000000 > >> CCR: 0000000044004824 MQ: 0000000000000001 DAR: > >> 00003fffb729c590 > >> DSISR: 000000000a000000 Syscall Result: 0000000000000000 > >> > >> > >> instead of this.. > >> > >> PID: 2121 TASK: c0000001af90c600 CPU: 2 COMMAND: "sshd" > >> #0 [c0000001ae047a30] (null) at c0000000fd783c00 (unreliable) > >> #1 [c0000001ae047a70] avc_has_perm at c0000000003e5280 > >> #2 [c0000001ae047b60] sock_has_perm at c0000000003e6238 > >> #3 [c0000001ae047be0] security_socket_sendmsg at c0000000003e28fc > >> #4 [c0000001ae047c30] sock_sendmsg at c00000000072d53c > >> #5 [c0000001ae047c60] sock_write_iter at c00000000072d644 > >> #6 [c0000001ae047d00] __vfs_write at c0000000002ed97c > >> #7 [c0000001ae047d90] vfs_write at c0000000002ef328 > >> #8 [c0000001ae047de0] sys_write at c0000000002f0f00 > >> #9 [c0000001ae047e30] system_call at c00000000000b184 > >> System Call [c00] exception frame: > >> R0: 0000000000000004 R1: 00003ffff81eb220 R2: > >> 00003fffb6b99800 > >> R3: 0000000000000003 R4: 000001000b80e3c0 R5: > >> 0000000000000034 > >> R6: 00003ffff81eb2e4 R7: 000000000000021e R8: > >> 0000000000000000 > >> R9: 0000000000000000 R10: 0000000000000000 R11: > >> 0000000000000000 > >> R12: 0000000000000000 R13: 00003fffb6497730 R14: > >> 0000000000000000 > >> R15: 0000000000000000 R16: 000001000b7dad70 R17: > >> 000000005dfd3c08 > >> R18: 000000005dfd2838 R19: 00003ffff81eb620 R20: > >> 000000005df74128 > >> R21: 000001000b7d89a0 R22: 000000000000de4c R23: > >> 000000005df73b30 > >> R24: 000000005dfd3c88 R25: 00003ffff81eb428 R26: > >> 00003ffff81eb430 > >> R27: 00003ffff81eb420 R28: 00003ffff81eb424 R29: > >> 00003ffff81eb2e4 > >> R30: 000001000b80e3c0 R31: 0000000000000034 > >> NIP: 00003fffb6ac8400 MSR: 800000000000d033 OR3: > >> 0000000000000003 > >> CTR: 0000000000000000 LR: 000000005df1c3e4 XER: > >> 0000000000000000 > >> CCR: 0000000044004824 MQ: 0000000000000001 DAR: > >> 00003fffb729c590 > >> DSISR: 000000000a000000 Syscall Result: 0000000000000000 > >> > >> > >> On second thought, that may not be bad after all?? > >> So, I am ok with the change you propose. > > Hmmm, except that in the "sshd" sample showing the firmware-generated > > eframe, > > and which the task was presumably running in kernel space when firmware > > took > > over (?), it has a userspace NIP of 00003fffb6ac8400. What's happening > > there? > > > > IIUC, NIP 00003fffb6ac8400 must have caused the exception (system call > in this case), > and the backtrace shows the kernel call stack following the system call? > > Thanks > Hari > All right, I'll check in your v2 patch along with the one-line addition to display the exception frame. For now we'll skip the NIP and LR translation, since in the case above, it's somewhat confusing. Thanks, Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility