So, when you run an NFS server and have a lot of clients simultaneously writing to the exported volumes, you should expect to have a high load average. In fact, on a fully loaded NFS server I'd expect to see a load average that is equal to the number of nfsd threads you have configured in /etc/sysconfig/nfs, plus whatever CPU load your system normally has. Why is this? The load average is computed based on the average number of items in the CPU's runtime queue. When you're serving NFS, each thread is often waiting on I/O or sending data, so each thread is increasing the load average by one. This is very common and expected behavior. You'll often see the same thing on web servers when there are a lot of httpds waiting for I/O (for example, if they're getting a denial of service attack that leaves half-open connections). The actual amount of CPU activity on these systems aren't actually that high, in most cases, the CPU is just sitting around waiting for the remote end or local disk to return data. On Wed, Jun 26, 2013 at 3:59 PM, Doll, Margaret Ann <margaret_doll@xxxxxxxxx > wrote: > The users' home directories are nfs'd to the compute nodes. > > On Wed, Jun 26, 2013 at 3:35 PM, Jonathan Billings <jsbillin@xxxxxxxxx > >wrote: > > > Hello, > > > > Is your head node an NFS server, and are the jobs writing to the NFS > share? > > > > > > On Wed, Jun 26, 2013 at 3:27 PM, Doll, Margaret Ann < > > margaret_doll@xxxxxxxxx > > > wrote: > > > > > I have a computer cluster Running rocks 5.2, Centos 6. > > > > > > The head node is over loaded. There are 2 CPUs on the head node. > > > > > > top - 14:27:49 up 1 day, 6:11, 6 users, load average: 13.65, 14.12, > > > 13.92 > > > Tasks: 168 total, 3 running, 163 sleeping, 0 stopped, 2 zombie > > > Cpu(s): 1.2%us, 1.9%sy, 0.0%ni, 0.0%id, 91.7%wa, 1.0%hi, 4.1%si, > > > 0.0%st > > > Mem: 2053088k total, 2001464k used, 51624k free, 74476k > buffers > > > Swap: 1020116k total, 388k used, 1019728k free, 1638076k cached > > > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > > > COMMAND > > > > > > 2515 nobody 15 0 218m 3176 1048 S 2.3 0.2 8:46.23 > > > gmetad > > > 2967 root 15 0 0 0 0 S 2.0 0.0 0:20.31 > > > nfsd > > > 2970 root 15 0 0 0 0 R 1.0 0.0 0:20.60 > > > nfsd > > > 3110 nobody 15 0 198m 20m 3360 S 0.3 1.0 4:22.71 > > > gmond > > > 29788 mad 15 0 90736 2336 1084 S 0.3 0.1 0:02.91 > > > sshd > > > 1 root 15 0 10372 684 572 S 0.0 0.0 0:00.51 > > > init > > > 2 root RT -5 0 0 0 S 0.0 0.0 0:00.00 > > > migration/0 > > > 3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 > > > ksoftirqd/0 > > > 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 > > > > > > I have everyone logged off of the head node. Four jobs are running on > > the > > > compute nodes, but I believe they are non-parallel jobs which causes no > > > traffic on the head node. The load_avg on each of the compute nodes > is > > > less than 8. Each compute node has 8 CPUs. > > > > > > How can I find the problem? I have seen the zombies go as high as 2 > on > > > the head node; most of the time there are 0 zombies. > > > > > > I did reboot the head node, but the problem comes back fairly quickly. > > > -- > > > redhat-list mailing list > > > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe > > > https://www.redhat.com/mailman/listinfo/redhat-list > > > > > > > > > > > -- > > Jonathan Billings <jsbillin@xxxxxxxxx> > > College of Engineering - CAEN - Unix and Linux Support > > -- > > redhat-list mailing list > > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe > > https://www.redhat.com/mailman/listinfo/redhat-list > > > -- > redhat-list mailing list > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe > https://www.redhat.com/mailman/listinfo/redhat-list > -- Jonathan Billings <jsbillin@xxxxxxxxx> College of Engineering - CAEN - Unix and Linux Support -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list