On 2024-03-15 15:21:19 [-0400], John B. Wyatt IV wrote: > Hello John, Hi John B., > The real-time team at Red Hat is discussing backporting the rt patchset in 6.6 > to RHEL 9/Stream 9. I decided to test v6.6.20-rt25 from stable-rt with > console_blast.sh. It reported similar NMIs from my testing of 6.7.0-rt6 with > that high cpu count server over uart 8250; which is expected since the patchset > is similar. > > One interesting thing is that 6.7.0-rt6 fully preemptive + realtime tuned profile > did not return any NMIs while 6.6 did with that same configuration. The thing is that console_blast.sh does this "show a backtrace on all CPUs, please" which triggers NMIs on all CPUs for backtrace. I can't imagine how you did obtain the backtraces without an NMI. Unless the tuned profile disables this somehow. > Another aspect I noticed during my testing. I did not set grub to > start with the realtime profile at boot for this machine. When I did set > it the second (and latter) NMI did not show for fully preemptive (the > 3rd set at the bottom of this email). > > Caller info was enabled. No modifications to the source code were made. > > I have not tested previous versions before 6.7.0-rt6 or 6.6.20-rt25; > with the exception of accidently testing 6.6.10-rt19. 6.6.10 also > reported NMIs during this test. If you wish to see these reports please > let me know. This NMI part has nothing todo with printk. If you need this clarified, I would need a reproducer. … > ----------------------------- > NMI Backtrace for 6.6.20-rt25 no forced preemption with tuned realtime profile > ----------------------------- > > [ T2614] Kernel panic - not syncing: sysrq triggered crash > [ C56] NMI backtrace for cpu 56 > [ C56] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.9409.D04.2212261349 12/26/2022 > [ C56] RIP: 0010:io_serial_out (arch/x86/kernel/early_printk.c:105) > [ C56] Code: 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 0f b6 8f c1 00 00 00 89 d0 0f b7 57 08 d3 e6 01 f2 ee <c3> cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 > All code > ======= … > 12: 90 nop > 13: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 18: 0f b6 8f c1 00 00 00 movzbl 0xc1(%rdi),%ecx > 1f: 89 d0 mov %edx,%eax > 21: 0f b7 57 08 movzwl 0x8(%rdi),%edx > 25: d3 e6 shl %cl,%esi > 27: 01 f2 add %esi,%edx > 29: ee out %al,(%dx) > 2a:* c3 ret <-- trapping instruction where is this output from? The `ret' opcode usually does not cause a trap. My guess is that the machine has been interrupted by an external user at this position. Side note: This is using early_printk, correct? … > [ C56] Call Trace: > [ C56] <NMI> > [ C56] ? nmi_cpu_backtrace (lib/nmi_backtrace.c:115) > [ C56] ? nmi_cpu_backtrace_handler (arch/x86/kernel/apic/hw_nmi.c:47 (discriminator 1)) > [ C56] ? nmi_handle (arch/x86/kernel/nmi.c:149) > [ C56] ? io_serial_out (arch/x86/kernel/early_printk.c:105) > [ C56] ? default_do_nmi (arch/x86/kernel/nmi.c:347) > [ C56] ? exc_nmi (arch/x86/kernel/nmi.c:538) > [ C56] ? end_repeat_nmi (arch/x86/entry/entry_64.S:1458) > [ C56] ? io_serial_out (arch/x86/kernel/early_printk.c:105) > [ C56] ? io_serial_out (arch/x86/kernel/early_printk.c:105) > [ C56] ? io_serial_out (arch/x86/kernel/early_printk.c:105) > [ C56] </NMI> This looks okay. The NMI did the backtrace as expected. > [ C56] <TASK> > [ C56] serial8250_console_putchar (./include/linux/serial_core.h:704 drivers/tty/serial/8250/8250_port.c:3347) > [ C56] ? __pfx_serial8250_console_putchar (drivers/tty/serial/8250/8250_port.c:3343) > [ C56] uart_console_write (drivers/tty/serial/serial_core.c:2134) > [ C56] serial8250_console_write_atomic (drivers/tty/serial/8250/8250_port.c:3628) > [ C56] nbcon_emit_next_record (kernel/printk/nbcon.c:940) > [ C56] __nbcon_atomic_flush_all (kernel/printk/nbcon.c:1192 (discriminator 1) kernel/printk/nbcon.c:1326 (discriminator 1)) > [ C56] vprintk_emit (kernel/printk/printk.c:2414) > [ C56] _printk (kernel/printk/printk.c:2474) > [ C56] panic (./arch/x86/include/asm/bitops.h:207 ./arch/x86/include/asm/bitops.h:239 ./include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/panic.c:528 kernel/panic.c:339) > [ C56] ? _printk (kernel/printk/printk.c:2474) > [ C56] sysrq_handle_crash (drivers/tty/sysrq.c:154) > [ C56] __handle_sysrq (drivers/tty/sysrq.c:601) > [ C56] write_sysrq_trigger (drivers/tty/sysrq.c:1165) > [ C56] proc_reg_write (fs/proc/inode.c:340 fs/proc/inode.c:352) > [ C56] ? preempt_count_add (./include/linux/ftrace.h:974 (discriminator 1) kernel/sched/core.c:5847 (discriminator 1) kernel/sched/core.c:5844 (discriminator 1) kernel/sched/core.c:5872 (discriminator 1)) > [ C56] vfs_write (fs/read_write.c:582) > [ C56] ksys_write (fs/read_write.c:637) > [ C56] do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:81) > [ C56] ? do_dup2 (fs/file.c:1142) > [ C56] ? syscall_exit_to_user_mode (kernel/entry/common.c:299) > [ C56] ? do_syscall_64 (arch/x86/entry/common.c:88) > [ C56] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561) > [ C56] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) According to this, someone issued a `crash' via sysrq. Why? … > [ T2614] ---[ end Kernel panic - not syncing: sysrq triggered crash ]--- yes exactly. … > NMI Backtrace for 6.6.20-rt25 no forced preemption with tuned throughput-performance profile > ----------------------------- This and the following backtrace shows the same picture: The CPU is crashing due to proc/sysrq request and does CPU-backtraces via NMI and polls in early_printk, waiting for the UART to become idle (probably). I don't see an issue here so far. … > -- > Sincerly, > John Wyatt > Software Engineer, Core Kernel Sebastian