[+ kexec@xxxxxxxxxxxxxxxxxxx] The debugging progress so far... Wait up to 5 minutes for other CPUs to stop in crash_smp_send_stop() made no difference. With "dev" branch of this tree [1], it is possible to print out messages from purgatory when passing something like "--port=0x602B0000 --port-lsr=0x602B0000,0x80" to kexec. However, even enable_dcache() in setup_arch() will hung like forever on this machine (working fine on another arm64 server - Cortex-A72). After removed only enable_dcache() / disable_dcache() from setup_arch() etc without removing printf() lines, it did print out, I'm in purgatory purgatory: entry=0000000090080000 purgatory: dtb=0000000092d50000 purgatory: D-cache Enabled before SHA verification purgatory: D-cache Disabled after SHA verification So, it confirmed that it must hung somewhere in arm64/kernel/head.S (.stext) or the early part of start_kernel() before earlycon was initialized. Also confirmed that passing nr_cpus=64 in the first kernel would again make everything work fine with this new kexec. Since enable_dcache() would hung as well, I suspect this has something to do with enabling MMU (i.e, .stext -> __primary_switch -> __enable_mmu) coupling with some sort of per-CPU data where the number of CPUs matters. Right now, I think I need to find a way to print directly to pl011 serial console while debugging those assembly code like CONFIG_DEBUG_LL for arm64, so it can be used to locate where exactly it hung. Otherwise, I am shooting in the dark. [1] https://github.com/pratyushanand/kexec-tools === original email === On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash dump just hung (4.20-rc6 as well as 4.18). It was confirmed that the executing went as far as entering __cpu_soft_restart(), __crash_kexec machine_kexec cpu_soft_restart restart __cpu_soft_restart The earlycon was enabled but had no output from the 2nd kernel, so it was pretty much stuck in all those assembly code in arm64/kernel/head.S or the early part of start_kernel() before earlycon was initialized. It turned out this has something to do with nr_cpus in the 1st kernel, although the 2nd kernel always has nr_cpus=1 [1]. It was tested with both crashkernel=512M or 768M. nr_cpus <= 96 GOOD (2nd kernel was up in 2-3 mins.) nr_cpus=256 BAD (2nd kernel was NOT up after 1 hour.) nr_cpus=127 BAD (2nd kernel was NOT up after 10 mins.) I did also test with and without CONFIG_ARM64_VHE (i.e., el2_switch) made no difference. [1] KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 swiotlb=noforce reset_devices" I am still figuring out a way to debug those assembly code to where it actually hung, but the server was hooked up with a conserver that was not able to generate any sysrq and I have no shell access to the conserver, so seems a bit difficult to use kgdb or kdb in this case. CPU information, # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 4 Core(s) per socket: 32 Socket(s): 2 NUMA node(s): 2 Vendor ID: Cavium Model: 1 Model name: ThunderX2 99xx Stepping: 0x1 BogoMIPS: 400.00 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 32768K NUMA node0 CPU(s): 0-127 NUMA node1 CPU(s): 128-255 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec