Slowness on a head node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The last week and a half we have been experiencing a slow down on our cluster's head node. We are running Rocks 3 over a RedHat OS, 2.6.18-53.1.14.el5. The last down time was 174 days ago. And we have run successfully when 160 out 176 compute nodes, are running queued jobs. 16 of the
cores are reserved for interactive jobs.

There appears that no job is running on the head node. Over half the memory is used up, but there is still plenty of memory left. I know that large sftp transfers can slow the system down, but my users say
that their transfers are finished.

Where else should I look for the problem? There are no pending queue jobs currently, but there were
pending jobs during this last week and a half period.

top

Tasks: 142 total,   1 running, 141 sleeping,   0 stopped,   0 zombie
Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 54.5%id, 39.9%wa, 0.0%hi, 3.1%si, 0.0%st
Mem:   2054132k total,  1278532k used,   775600k free,   214924k buffers
Swap:  1020116k total,   799336k used,   220780k free,   801988k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2874 root      15   0     0    0    0 D    2  0.0  40:26.07 nfsd
 2871 root      15   0     0    0    0 S    1  0.0  40:07.30 nfsd
 2873 root      15   0     0    0    0 S    1  0.0  40:22.56 nfsd
 1697 root      10  -5     0    0    0 D    1  0.0  31:54.46 kjournald
 2872 root      15   0     0    0    0 D    1  0.0  40:50.30 nfsd
 2837 root      15   0 12712 1080  788 R    0  0.1   0:00.31 top
    1 root      15   0 10312   80   48 S    0  0.0   0:39.35 init

qstat -g c
CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS cdsuE
-------------------------------------------------------------------------------
all.q 0.78 0 160 160 0 0 chemistry 0.98 0 128 128 0 0 group1 1.01 64 0 64 0 0 group3 0.00 0 32 32 0 0 group3-24hr 0.00 0 32 32 0 0 group3-2hr 0.00 0 32 32 0 0 mem16.q 0.94 0 16 16 0 0 mem4.q 0.91 0 8 32 24 0 mem8.q 1.01 0 64 80 16 0 group2 0.94 61 3 64 0 0

 finger
Login Name Tty Idle Login Time Office Office Phone
acct2     		   pts/5   18:37  Mar 26 14:05
acct1                      pts/2   12:57  Mar 25 12:14
acct4       		   pts/4          Mar 23 09:36
acct5                      pts/8     10d  Mar  9 13:28
acct3                      pts/1   19:03  Mar 26 13:41

 ps -ef | grep sftp-server
root      2880  9542  0 08:47 pts/4    00:00:00 grep sftp-server
acct1 20700 20699 0 Mar25 ? 00:00:00 csh -c /usr/libexec/ openssh/sftp-server acct1 20829 20700 0 Mar25 ? 00:00:02 /usr/libexec/openssh/ sftp-server acct2 26707 26706 0 Mar26 ? 00:00:00 /usr/libexec/openssh/ sftp-server acct3 31337 31336 0 Mar26 ? 00:00:00 csh -c /usr/libexec/ openssh/sftp-server acct3 31466 31337 0 Mar26 ? 00:00:00 /usr/libexec/openssh/ sftp-server

iostat
Linux 2.6.18-53.1.14.el5 system.edu)        03/27/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.83    0.00    0.41    2.34    0.00   92.42

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda              37.34       309.74       582.06 4681624648 8797649022


mpstat -P ALL
Linux 2.6.18-53.1.14.el5 (system.edu)        03/27/2009

08:48:37 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 08:48:37 AM all 4.83 0.00 0.31 2.34 0.02 0.09 0.00 92.42 267.91 08:48:37 AM 0 4.26 0.00 0.27 1.15 0.00 0.02 0.00 94.30 150.40 08:48:37 AM 1 5.39 0.00 0.34 3.54 0.03 0.16 0.00 90.54 117.51

 free
total used free shared buffers cached Mem: 2054132 1274000 780132 0 215796 802872
-/+ buffers/cache:     255332    1798800
Swap:      1020116     799268     220848



--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

[Index of Archives]     [CentOS]     [Kernel Development]     [PAM]     [Fedora Users]     [Red Hat Development]     [Big List of Linux Books]     [Linux Admin]     [Gimp]     [Asterisk PBX]     [Yosemite News]     [Red Hat Crash Utility]


  Powered by Linux