On Fri, Jan 20, 2012 at 03:16:58AM +0530, Srivatsa S. Bhat wrote: > [Reinstating the original Cc list] > > On 01/19/2012 09:50 PM, Mel Gorman wrote:> > > > On a different x86-64 machines with an intel-specific MCE, I have > > also noted that the value of num_online_cpus() can change while > > stop_machine() is running. > > > That is expected and intentional right? Meaning, it is during the > stop_machine() thing itself that a CPU is actually taken offline. > And at the same time, it is removed from the cpu_online_mask. > It's intentional sometimes and no others. The machine does halt sometimes and stays there. > On Intel boxes, essentially, the following gets executed on the dying > CPU, as set up by the stop_machine stuff. > > __cpu_disable() > native_cpu_disable() > cpu_disable_common() > remove_cpu_from_maps() > set_cpu_online(cpu, false) > ^^^^^^ > So, set_cpu_online will remove this CPU from the cpu_online_mask. > And all this runs while still under the stop machine context. > And this is exactly what we want right? > We don't want it to halt in stop_machine forever waiting on acknowledges that are never received until the NMI handler fires. > > This is sensitive to timing and part of > > the problem seems to be due to cmci_rediscover() running without the > > CPU hotplug mutex held. This is not related to the IPI mess and is > > unrelated to memory pressure but is just to note that CPU hotplug in > > general can be fragile in parts. > > > > > For the cmci_rediscover() part, I feel a simple get/put_online_cpus() > around it should work. > Yeah, that's the first thing I tried first too. Doesn't work though. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>