Hi! Remember the fs.sh status checks mayhem I reported a while ago? Now, there was the ghost-like load flux, but the system getting stuck wasn't (only) because of the excess number of execs - it was, plain and simple, memory starvation. *sigh* Anyway, now that I (or, to be exact, my servers) have enough memory, I noticed that the problem with the inexplicable load flux hasn't gone anywhere. With a more-or-less regular 11-hour interval, there is a four-hour long peak in the load, shaped like an elf's pointy hat. (In an otherwise idle system, the height of the peak is abt 6.0. If there is load caused by something "real", the peak is on top of the other load - it looks as if it just linearly adds up.) I'm seriously beginning to consider the possibility that there are elfs in my kernel, since I can't see the peaks anywhere else than the loads: CPU usage, number of processes, IP/TCP/UDP traffic, IO load, paging activity - nothing reflects the load peaks. I had a look at the process accounting statistics during a peak and during no peak, but couldn't see any difference. One suggestion my colleague had was that the peaks might be caused by the cluster somehow changing the 'lead' - somewhere inside the kernel, in such a low level that it can't be noticed elsewhere than in the load. That was because there is a difference of phase in the peaks. It didn't sound very credible to me, but I'll ask anyway: could there be something like that going on? On the other hand, on the one node in the cluster that doesn't have rgmanager running (it's in the cluster so that there wouldn't be an even number of nodes), I'm not seeing these elfs. And I have an another cluster that had the elf-hats before I added an exit 0 into their fs.sh scripts. But they don't have the elf-hats anymore. The difference between these two clusters is that the cluster with elfs has a lot more active cluster services than the one without. That is, the cluster with elfs has a lot more, say, ip.sh execs than the one without. I wonder if these, when over a certain limit, could have an effect on the load similar to the excess fs.fh execs had? Next, I think I'm going to put an exit 0 to the status checks of ip.sh (and see if the elfs go away). Then I'm going to start wondering if the cluster'd notice our server room falling apart... ;) Any suggestions? At this point, I'm not any more even certain whether the problem lies within the cluster. On the other hand, since I see no difference at the process level during peak and no-peak time, the difference must (as far as I understand) be inside kernel. So it can't be my application. So it must be the cluster, mustn't it? Thanks. --Janne -- Janne Peltonen <janne.peltonen@xxxxxxxxxxx> -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster