Alok Kataria <akataria at vmware.com> writes: > Hi Eric, > > Thanks for taking a look. > > On Mon, 2010-10-11 at 11:07 -0700, Eric W. Biederman wrote: >> Alok Kataria <akataria at vmware.com> writes: >> >> > Copying a few more maintainers on the thread... >> > On Fri, 2010-10-08 at 13:34 -0700, Alok Kataria wrote: >> >> Before starting the new kernel kexec calls machine_shutdown to stop all >> >> the cpus, which internally calls native_smp_send_stop. AFAIU, kexec >> >> expects that all the cpus are now halted after that call returns. >> >> Now, looking at the code for native_smp_send_stop, it assumes that all >> >> the processors have processed the REBOOT ipi in 1 second after the IPI >> >> was sent. >> >> native_smp_send_stop() >> >> --------------------------------------------------------- >> >> apic->send_IPI_allbutself(REBOOT_VECTOR); >> >> >> >> /* Don't wait longer than a second */ >> >> wait = USEC_PER_SEC; >> >> while (num_online_cpus() > 1 && wait--) >> >> udelay(1); >> >> --------------------------------------------------------- >> >> >> >> It just returns after that 1 second irrespective of whether all cpus >> >> were halted or not. This brings up a issue in the kexec case, since we >> >> can have the BSP starting the new kernel and AP's still processing the >> >> REBOOT IPI simultaneously. >> >> >> >> Many distribution kernels use kexec to load the newly installed kernel >> >> during the installation phase, in virtualized environment with the host >> >> heavily overcommitted, we have seen some instances when vcpu fails to >> >> process the IPI in the allotted 1 sec and as a result the AP's end up >> >> accessing uninitialized state (the BSP has already gone ahead with >> >> setting up the new state) and causing GPF's. >> >> >> >> IMO, kexec expects machine_shutdown to return only after all cpus are >> >> stopped. >> >> >> >> The patch below should fix the issue, comments ?? >> >> >> >> -- >> >> machine_shutdown now takes a parameter "wait", if it is true, it waits >> >> until all the cpus are halted. All the callers except kexec still >> >> fallback to the earlier version of the shutdown call, where it just >> >> waited for max 1 sec before returning. >> >> The approach seems reasonable. However on every path except for the >> panic path I would wait for all of the cpus to stop, as that is what >> those code paths are expecting. Until last year smp_send_stop always >> waited until all of the cpus stopped. So I think changing >> machine_shutdown should not need to happen. >> >> This patch cannot be used as is because it changes the parameters to >> smp_send_stop() and there are more than just x86 implementations out >> there. >> >> Let me propose a slightly different tactic. We need to separate >> the panic path from the normal path to avoid confusion. At the >> generic level smp_send_stop is exclusively used for the panic >> path. So we should keep that, and rename the x86 non-panic path >> function something else. >> >> Can you rename the x86 smp_send_stop function say stop_all_other_cpus(). >> >> At which point we could implement smp_send_stop as: >> stop_all_other_cpus(0); >> >> And everywhere else would call stop_all_other_cpus(1) waiting for >> the cpus to actually stop. > > Done, I have added a stop_other_cpus function which calls > smp_ops.stop_other_cpus(1) > >> >> I really think we want to wait for the cpus to stop in all cases except >> for panic/kdump where we expect things to be broken and we are doing >> our best to make things work anyway. > Now it does, except in the panic case. > > Jeremy, I have modified the xen_reboot bits too so that it now waits > until all cpus are actually stopped, please take a look. > > -- > > x86 smp_ops now has a new op, stop_other_cpus which takes a parameter "wait" > this allows the caller to specify if it wants to stop untill all the cpus > have processed the stop IPI. This is required specifically for the kexec case > where we should wait for all the cpus to be stopped before starting the new > kernel. > We now wait for the cpus to stop in all cases except for panic/kdump where > we expect things to be broken and we are doing our best to make things > work anyway. One little nit. > Signed-off-by: Alok N Kataria <akataria at vmware.com> > Cc: Eric W. Biederman <ebiederm at xmission.com> > Cc: Jeremy Fitzhardinge <jeremy at xensource.com> > > Index: linux-2.6/arch/x86/include/asm/smp.h > =================================================================== > --- linux-2.6.orig/arch/x86/include/asm/smp.h 2010-10-11 11:56:40.000000000 -0700 > +++ linux-2.6/arch/x86/include/asm/smp.h 2010-10-11 12:00:22.000000000 -0700 > @@ -50,7 +50,7 @@ struct smp_ops { > void (*smp_prepare_cpus)(unsigned max_cpus); > void (*smp_cpus_done)(unsigned max_cpus); > > - void (*smp_send_stop)(void); > + void (*stop_other_cpus)(int wait); > void (*smp_send_reschedule)(int cpu); > > int (*cpu_up)(unsigned cpu); > @@ -73,7 +73,12 @@ extern struct smp_ops smp_ops; > > static inline void smp_send_stop(void) > { > - smp_ops.smp_send_stop(); > + smp_ops.stop_other_cpus(0); > +} > + > +static inline void stop_other_cpus(void) > +{ > + smp_ops.stop_other_cpus(1); > } > > static inline void smp_prepare_boot_cpu(void) > Index: linux-2.6/arch/x86/kernel/reboot.c > =================================================================== > --- linux-2.6.orig/arch/x86/kernel/reboot.c 2010-10-11 11:56:40.000000000 -0700 > +++ linux-2.6/arch/x86/kernel/reboot.c 2010-10-11 12:05:02.000000000 -0700 > @@ -641,7 +641,7 @@ void native_machine_shutdown(void) > /* O.K Now that I'm on the appropriate processor, > * stop all of the others. > */ > - smp_send_stop(); > + stop_other_cpus(); > #endif > > lapic_shutdown(); > Index: linux-2.6/arch/x86/kernel/smp.c > =================================================================== > --- linux-2.6.orig/arch/x86/kernel/smp.c 2010-10-11 11:56:40.000000000 -0700 > +++ linux-2.6/arch/x86/kernel/smp.c 2010-10-11 12:16:42.000000000 -0700 > @@ -159,10 +159,10 @@ asmlinkage void smp_reboot_interrupt(voi > irq_exit(); > } > > -static void native_smp_send_stop(void) > +static void native_stop_other_cpus(int wait) > { > unsigned long flags; > - unsigned long wait; > + unsigned long timeout; > > if (reboot_force) > return; > @@ -179,9 +179,12 @@ static void native_smp_send_stop(void) > if (num_online_cpus() > 1) { > apic->send_IPI_allbutself(REBOOT_VECTOR); > > - /* Don't wait longer than a second */ > - wait = USEC_PER_SEC; > - while (num_online_cpus() > 1 && wait--) > + /* > + * Don't wait longer than a second if the caller > + * didn't ask us to wait. > + */ > + timeout = USEC_PER_SEC; > + while (num_online_cpus() > 1 && (wait || timeout--)) > udelay(1); > } > > @@ -227,7 +230,7 @@ struct smp_ops smp_ops = { > .smp_prepare_cpus = native_smp_prepare_cpus, > .smp_cpus_done = native_smp_cpus_done, > > - .smp_send_stop = native_smp_send_stop, > + .stop_other_cpus = native_stop_other_cpus, > .smp_send_reschedule = native_smp_send_reschedule, > > .cpu_up = native_cpu_up, > Index: linux-2.6/arch/x86/xen/enlighten.c > =================================================================== > --- linux-2.6.orig/arch/x86/xen/enlighten.c 2010-10-11 11:56:40.000000000 -0700 > +++ linux-2.6/arch/x86/xen/enlighten.c 2010-10-11 11:58:53.000000000 -0700 > @@ -1018,7 +1018,7 @@ static void xen_reboot(int reason) > struct sched_shutdown r = { .reason = reason }; > > #ifdef CONFIG_SMP > - smp_send_stop(); > + stop_other_cpus(); > #endif > > if (HYPERVISOR_sched_op(SCHEDOP_shutdown, &r)) > Index: linux-2.6/arch/x86/xen/smp.c > =================================================================== > --- linux-2.6.orig/arch/x86/xen/smp.c 2010-10-11 11:56:40.000000000 -0700 > +++ linux-2.6/arch/x86/xen/smp.c 2010-10-11 11:58:53.000000000 -0700 > @@ -400,9 +400,9 @@ static void stop_self(void *v) > BUG(); > } > > -static void xen_smp_send_stop(void) > +static void xen_stop_other_cpus(int wait) > { > - smp_call_function(stop_self, NULL, 0); > + smp_call_function(stop_self, NULL, wait); > } > > static void xen_smp_send_reschedule(int cpu) > @@ -470,7 +470,7 @@ static const struct smp_ops xen_smp_ops > .cpu_disable = xen_cpu_disable, > .play_dead = xen_play_dead, > > - .smp_send_stop = xen_smp_send_stop, > + .stop_other_cpus = xen_stop_other_cpus, > .smp_send_reschedule = xen_smp_send_reschedule, > > .send_call_func_ipi = xen_smp_send_call_function_ipi, > Index: linux-2.6/include/linux/smp.h > =================================================================== > --- linux-2.6.orig/include/linux/smp.h 2010-10-11 11:56:40.000000000 -0700 > +++ linux-2.6/include/linux/smp.h 2010-10-11 12:07:19.000000000 -0700 > @@ -116,7 +116,8 @@ extern unsigned int setup_max_cpus; > > #else /* !SMP */ > > -static inline void smp_send_stop(void) { } > +static inline void smp_send_stop() { } > +static inline void stop_other_cpus() { } You shouldn't change smp.h. For the moment stop_other_cpus() is x86 specific so we shouldn't impose it in other architectures. Also note removing the explicit void is a bad idea in C because that is equivalent to saying: smp_send_stop(...) { } Which allows any set of parameters you can think of. Eric