> From: Michal Hocko [mailto:mhocko@xxxxxxxxxx] > > On Thu 30-07-15 01:45:35, 河合英宏 / KAWAI,HIDEHIRO wrote: > > Hi, > > > > > From: Michal Hocko [mailto:mhocko@xxxxxxxxxx] > > > > > > On Wed 29-07-15 09:09:18, 河合英宏 / KAWAI,HIDEHIRO wrote: > [...] > > > > #define nmi_panic(fmt, ...) \ > > > > do { \ > > > > if (atomic_cmpxchg(&panic_cpu, -1, raw_smp_processor_id()) \ > > > > == -1) \ > > > > panic(fmt, ##__VA_ARGS__); \ > > > > } while (0) > > > > > > This would allow to return from NMI too eagerly. > > > > Yes, but what's the problem? > > I believe that panic should be noreturn as much as possible and return > only when we do not have any other options. In my case, external NMI is delivered to only CPU 0. This means other CPUs continues to run until receiving NMI IPI. I think returning from NMI handler soon doesn't differ from this so much. > Moreover I would ask an > opposite question, what is the problem to loop in NMI on other CPUs than > the one which is performing crash_kexec? We will not save registers, so > what? Actually, we have to do at least save registers, clean up VMX/SVM states and announce that the cpu has stopped. Just returning from NMI handler is simpler. > > The root cause of your case hasn't been clarified yet. > > I can't fix for an unclear issue because I don't know what's the right > > solution. > > > > > When I was testing my > > > previous approach (on 3.0 based kernel) I had basically the same thing > > > (one NMI to process panic) and others to return. This led to a strange > > > behavior when the NMI button triggered NMI on all (hundreds) CPUs. > > > > It's strange. Usually, NMI caused by NMI button is routed to only CPU 0 > > as an external NMI. External NMI for CPUs other than CPU 0 are masked > > at boot time. Does it really happen? > > Could you point me to the code which does that, please? Maybe we are > missing that in our 3.0 kernel. I was quite surprised to see this > behavior as well. Please see the snippet below. void setup_local_APIC(void) { ... /* * only the BP should see the LINT1 NMI signal, obviously. */ if (!cpu) value = APIC_DM_NMI; else value = APIC_DM_NMI | APIC_LVT_MASKED; if (!lapic_is_integrated()) /* 82489DX */ value |= APIC_LVT_LEVEL_TRIGGER; apic_write(APIC_LVT1, value); LINT1 pins of cpus other than CPU 0 are masked here. However, at least on some of Hitachi servers, NMI caused by NMI button doesn't seem to be delivered through LINT1. So, my `external NMI' word may not be correct. > > Does the problem still happen on the latest kernel? > > I do not have machine accessible so I have to rely on the customer to > test and the current vanilla might be an issue. Sure. > > What kind of NMI is deliverd to each CPU? > > See the log below. > > > Traditionally, we should have assumed that NMI for crash dumping is > > delivered to only one cpu. Otherwise, we should often fail to take > > a proper crash dump. > > You might still get a panic on hardlockup which will happen on all CPUs > from the NMI context so we have to be able to handle panic in NMI on > many CPUs. Do you say about the case of a kerne panic while other cpus locks up in NMI context? In that case, there is no way to do things needed by kdump procedure including saving registeres... > > It seems that your case is another problem to be solved separately. > > I do not think so, quite contrary. If you want to solve the reentrancy > then other CPUs might be spinning in NMI if there is a guarantee that at > least one CPU can progress to finish crash_kexec(). > > > > The > > > crash kernel booted eventually but the log contained lockups when a > > > CPU waited for an IPI to the CPU which was handling the NMI panic. > > > > Could you explain more precisely? > > [ 167.843761] Uhhuh. NMI received for unknown reason 3d on CPU 130. > [ 167.843763] Do you have a strange power saving mode enabled? > [... Mangled output ....] > [ 167.856415] Dazed and confused, but trying to continue > [ 167.856428] Dazed and confused, but trying to continue > [ 167.856442] Dazed and confused, but trying to continue > [...] > [ 193.108440] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4] > [...] > [ 193.108586] Call Trace: > [ 193.108595] [<ffffffff8109baeb>] smp_call_function_single+0x15b/0x170 > [ 193.108600] [<ffffffff8109bb4e>] smp_call_function_any+0x4e/0x110 > [ 193.108607] [<ffffffffa04a332c>] get_cur_val+0xbc/0x130 [acpi_cpufreq] > [ 193.108630] [<ffffffffa04a3417>] get_cur_freq_on_cpu+0x77/0xf0 [acpi_cpufreq] > [ 193.108638] [<ffffffff8137bc37>] cpufreq_update_policy+0x97/0x140 > [ 193.108646] [<ffffffffa00ca04b>] acpi_processor_notify+0x4b/0x145 [processor] > [ 193.108654] [<ffffffff812d2eca>] acpi_ev_notify_dispatch+0x61/0x77 > [ 193.108659] [<ffffffff812c1785>] acpi_os_execute_deferred+0x21/0x2c > [ 193.108667] [<ffffffff8107d03c>] process_one_work+0x16c/0x350 > [ 193.108673] [<ffffffff8107fd6a>] worker_thread+0x17a/0x410 > [ 193.108679] [<ffffffff81084136>] kthread+0x96/0xa0 > [ 193.108688] [<ffffffff8146df64>] kernel_thread_helper+0x4/0x10 > [...] > [ 221.068390] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4] > [...] > [ 227.991235] INFO: rcu_sched_state detected stalls on CPUs/tasks: { 130} (detected by 56, t=15002 jiffies) > [ 227.991247] sending NMI to all CPUs: > [ 227.991251] NMI backtrace for cpu 0 > [ 229.074091] INFO: rcu_bh_state detected stalls on CPUs/tasks: { 130} (detected by 105, t=15013 jiffies) > [ 0.000000] Initializing cgroup subsys cpuset > [ 0.000000] Initializing cgroup subsys cpu > [ 0.000000] Linux version 3.0.101-0.47.55.9.8853.0.TEST-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch > revision 152973] (SUSE Linux) ) #1 SMP Thu May 28 08:25:11 UTC 2015 (dc083ee) > [ 0.000000] Command line: root=/dev/system/lvroot resume=/dev/system/lvswap intel_idle.max_cstate=0 > processor.max_cstate=0 elevator=deadline nmi_watchdog=1 console=tty0 console=ttyS1,115200 elevator=deadline sysrq=yes > reset_devices irqpoll maxcpus=1 disable_cpu_apicid=0 noefi acpi_rsdp=0xba7a4014 crashkernel=1024M-:512M memmap=exactmap > memmap=576K@64K memmap=523684K@393216K elfcorehdr=916900K memmap=32768K#3018748K memmap=3736K#3051516K > memmap=262144K$3145728K > > I can provide the full log but it is quite mangled. I guess the > CPU130 was the only one allowed to proceed with the panic while others > returned from the unknown NMI handling. It took a lot of time until > CPU130 managed to boot the crash kernel with soft lockups and RCU stalls > reports. CPU0 is most probably locked up waiting for CPU130 to > acknowledge the IPI which will not happen apparently. There is a timeout of 1000ms in nmi_shootdown_cpus(), so I don't know why CPU 130 waits so long. I'll try to consider for a while. > Maybe this is not possible in the current kernels for some reason but it > tells me that returning from panic is quite fragile so I would like to > prevent from it as much as possible. > > > > Anyway, I do not thing this is really necessary to solve the panic > > > reentrancy issue. > > > If the missing saved state is a real problem then it > > > should be handled separately - maybe it can be achieved without an IPI > > > and directly from the panic context if we are in NMI. > > > > What I would like to do via this patchse is to solve race issues > > among NMI, panic() and crash_kexec(). > > Yes I fully support you in this ;) I just believe that spinning in NMI > vs. saving registers is a separate issue. Ok, but I'm going to address it in this series because the issue is caused by simultaneous panic and nmis. > > > So, I don't think we should fix that separately, although I would need > > to reword some descriptions and titles. > > I can have them tested. Thanks a lot! Regards, Kawai ��.n��������+%������w��{.n�����{����*jg��������ݢj����G�������j:+v���w�m������w�������h�����٥