On Sun, 28 May 2006 02:06:03 +0100 Ralf Baechle <ralf@xxxxxxxxxxxxxx> wrote: > On Sat, May 27, 2006 at 05:13:21PM -0400, Kumba wrote: > > > Finally managed to track down the git commit causing SGI IP32 (O2) systems > > to lock up really early in the boot cycle, but I'm at a loss to understand > > why. > > > > Effect: > > It appears the system silently hangs somewhere in the void between function > > calls when trying to invoke the memset() call in __alloc_bootmem_core() in > > mm/bootmem.c. This puts the machine hardware in a state such that a simple > > soft reset doesn't clear it -- the machine has to be cold booted to get it > > to boot a working kernel again. > > > > Determined Cause: > > It seems this commit: > > 78eef01b0fae087c5fadbd85dd4fe2918c3a015f > > [PATCH] on_each_cpu(): disable local interrupts > > > > Is the cause. I've verified this by reversing this one change on a > > 2.6.17-rc4 tree, and it'll boot to a mini-userland (initramfs-based) and > > appears to function normally. > > > > > > But this is as far as I can trace this. I'm not sure what this change is > > doing internally that's triggering this lockup on O2 systems. It doesn't > > appear to affect Octane (IP30) systems or Origin (IP27). I haven't > > test-ran it on IP22/IP28 hardware yet, so only IP32 is known to be > > affected. Unsure about non-SGI MIPS hardware. > > on_each_cpu is re-enabling interrupt. This may crash the system if it > happens before interrupt handlers have been installed. on_each_cpu() calls smp_call_function(). It is not correct to call smp_call_function() with local interrupts disabled, because it can lead to deadlocks. Therefore on_each_cpu() also must not be called with local interrupts disabled. Therefore on_each_cpu() may use local_irq_disable()/local_irq_enable(). > A while ago I've > fixes all such calls but I may have missed some instances. > > Andrew, what was the reason for 78eef01b0fae087c5fadbd85dd4fe2918c3a015f ? > That change made the various calling environments consistent, as described in the changelog. If it's really, really not deadlocky to call smp_call_function() with interrupts disabled at that time in the MIPS kernel bringup then I'd suggest you should open-code an smp_call_function() and put a big comment over it explaining why it's done this way, and why it isn't deadlocky. <tries to remember what the deadlock is> If CPU A is running smp_call_function() it's waiting for CPU B to run the handler. But if CPU B is presently _also_ running smp_call_function(), it's waiting for CPU A to run the handler. If either of those CPUs is waiting for the other with local interrupts disabled, that CPU will never respond to the other CPU's IPI and they'll deadlock.