depends on arch and if standard kernel.org is modified. Kernel.org does the following: x86_64 has nmi_watchdog default to off i386 has nmi watchdog default to on no other arches have nmi watchdog that I am aware of. The nmi watchdog simply prints out a backtrace when interrupts are off for too long. This occurs because of a buggy software driver or kernel code that clears interrupts on a processor and doesn't reenable them. Hence, the nmi watchdog is not fed, and it triggers a stack backtrace (instead of a total lockup) which allows someone experienced in development to find the source of the offending lock and fix the kernel code. I really doubt if you are using any commercial vendor kernel with supplied drivers you will encounter this sort of failure; this feature is generally used during development of kernel code. Some vendor kernels do special things when an nmi watchdog occurs, like take a system memory dump and then reboot, to allow debugging of the crash by the vendor at a later time. Regards -steve On Wed, 2005-12-21 at 16:50 -0200, Celso K. Webber wrote: > Hi Lon, > > Thank you very much for your reply. I'll try your tips. > > Now another question: is it really necessary to pass on the > "nmi_watchdog=1" parameter to the kernel? Or is it enabled by default > under RHELv3 ou v4? > > Regards, > > Celso. > > Lon Hohberger escreveu: > > >On Wed, 2005-12-21 at 16:25 -0200, Celso K. Webber wrote: > > > > > > > >>Does anyone has had this issue before? Or am I missing any step on > >>configuring the software watchdog feature? > >> > >>Another question for the Red Hat people on the list: does this "software > >>watchdog" works ok? I ask because it's enabled by default when you add a > >>new member to the cluster. The Cluster Suite v3 manual tells nothing > >>about this resource either. > >> > >> > > > >Yes, it works fine. > > > >A few things could be happening: > > > >(1) The NMI watchdog will reboot the machine if it detects an NMI hang. > >This is only a few seconds. > > > >(2) The cluster is extremely paranoid because you are not using a > >STONITH device (power controller), and it's detecting internal hangs. > >Try increasing the failover time. > > > >(3) The cluster is not getting scheduled due to system load. See the > >man page for cludb(8) about clumembd%rtp - both may help. > > > > > >-- Lon > > > > > > -- > > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster