The last week and a half we have been experiencing a slow down on our
cluster's head node.
We are running Rocks 3 over a RedHat OS, 2.6.18-53.1.14.el5. The
last down time was 174 days ago.
And we have run successfully when 160 out 176 compute nodes, are
running queued jobs. 16 of the
cores are reserved for interactive jobs.
There appears that no job is running on the head node. Over half the
memory is used up, but there is
still plenty of memory left. I know that large sftp transfers can
slow the system down, but my users say
that their transfers are finished.
Where else should I look for the problem? There are no pending queue
jobs currently, but there were
pending jobs during this last week and a half period.
top
Tasks: 142 total, 1 running, 141 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 2.6%sy, 0.0%ni, 54.5%id, 39.9%wa, 0.0%hi,
3.1%si, 0.0%st
Mem: 2054132k total, 1278532k used, 775600k free, 214924k buffers
Swap: 1020116k total, 799336k used, 220780k free, 801988k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2874 root 15 0 0 0 0 D 2 0.0 40:26.07 nfsd
2871 root 15 0 0 0 0 S 1 0.0 40:07.30 nfsd
2873 root 15 0 0 0 0 S 1 0.0 40:22.56 nfsd
1697 root 10 -5 0 0 0 D 1 0.0 31:54.46 kjournald
2872 root 15 0 0 0 0 D 1 0.0 40:50.30 nfsd
2837 root 15 0 12712 1080 788 R 0 0.1 0:00.31 top
1 root 15 0 10312 80 48 S 0 0.0 0:39.35 init
qstat -g c
CLUSTER QUEUE CQLOAD USED AVAIL TOTAL aoACDS
cdsuE
-------------------------------------------------------------------------------
all.q 0.78 0 160 160
0 0
chemistry 0.98 0 128 128
0 0
group1 1.01 64 0 64
0 0
group3 0.00 0 32 32
0 0
group3-24hr 0.00 0 32 32
0 0
group3-2hr 0.00 0 32 32
0 0
mem16.q 0.94 0 16 16
0 0
mem4.q 0.91 0 8 32
24 0
mem8.q 1.01 0 64 80
16 0
group2 0.94 61 3 64
0 0
finger
Login Name Tty Idle Login Time Office
Office Phone
acct2 pts/5 18:37 Mar 26 14:05
acct1 pts/2 12:57 Mar 25 12:14
acct4 pts/4 Mar 23 09:36
acct5 pts/8 10d Mar 9 13:28
acct3 pts/1 19:03 Mar 26 13:41
ps -ef | grep sftp-server
root 2880 9542 0 08:47 pts/4 00:00:00 grep sftp-server
acct1 20700 20699 0 Mar25 ? 00:00:00 csh -c /usr/libexec/
openssh/sftp-server
acct1 20829 20700 0 Mar25 ? 00:00:02 /usr/libexec/openssh/
sftp-server
acct2 26707 26706 0 Mar26 ? 00:00:00 /usr/libexec/openssh/
sftp-server
acct3 31337 31336 0 Mar26 ? 00:00:00 csh -c /usr/libexec/
openssh/sftp-server
acct3 31466 31337 0 Mar26 ? 00:00:00 /usr/libexec/openssh/
sftp-server
iostat
Linux 2.6.18-53.1.14.el5 system.edu) 03/27/2009
avg-cpu: %user %nice %system %iowait %steal %idle
4.83 0.00 0.41 2.34 0.00 92.42
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 37.34 309.74 582.06 4681624648 8797649022
mpstat -P ALL
Linux 2.6.18-53.1.14.el5 (system.edu) 03/27/2009
08:48:37 AM CPU %user %nice %sys %iowait %irq %soft
%steal %idle intr/s
08:48:37 AM all 4.83 0.00 0.31 2.34 0.02 0.09
0.00 92.42 267.91
08:48:37 AM 0 4.26 0.00 0.27 1.15 0.00 0.02
0.00 94.30 150.40
08:48:37 AM 1 5.39 0.00 0.34 3.54 0.03 0.16
0.00 90.54 117.51
free
total used free shared buffers
cached
Mem: 2054132 1274000 780132 0 215796
802872
-/+ buffers/cache: 255332 1798800
Swap: 1020116 799268 220848
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list