RE: Looking for job which is causing a large work load

"Stainforth, Matthew (SD/DS)" <Matthew.Stainforth@xxxxxx> · Tue, 16 Feb 2010 12:18:28 -0400

Memory doesn't appear to be a problem.  Run "free" and look at the amount of free memory on the "+/- buffers/cache" line.

Top is reporting 3419 processes total with 600+ in a runnable state.  What does "ps auwwx" tell you?

-----Original Message-----
From: redhat-list-bounces@xxxxxxxxxx [mailto:redhat-list-bounces@xxxxxxxxxx] On Behalf Of Margaret Doll
Sent: Tuesday, February 16, 2010 11:54 AM
To: General Red Hat Linux discussion list
Subject: Looking for job which is causing a large work load

We have an eight processor system, running 2.6.18-128.1.6.el5xen   
Redhat.

We noticed the other day that sendmail was just queuing jobs and not  
sending them.
mqueue, however, is empty.

That lead us to look at the load average as a possible reason for the  
failure of sendmail.
The QueueLA on sendmail is set to "8" as it should be.

w and top show that we have a high load average and most of the memory  
on the system
is being used.  However, no job shows up in top using a lot of memory.

top - 10:50:52 up 232 days, 15:18, 20 users,  load average: 619.06,  
619.04, 618.98
Tasks: 3419 total,   1 running, 3417 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.3%us,  0.9%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,   
0.0%si,  0.0%st
Mem:  16099528k total, 16063880k used,    35648k free,   487200k buffers
Swap:  6127608k total,   105920k used,  6021688k free, 12683800k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11917 user1     16   0 13424 3624  784 S  3.8  0.0   0:04.16 top
11922 root      16   0 13360 3624  776 R  3.8  0.0   0:00.39 top
  8187 user1     16   0 13356 3620  780 S  3.5  0.0  44:48.71 top
11895 user1     16   0 13452 3648  780 R  3.5  0.0   0:11.35 top
     1 root      15   0 10348  632  540 S  0.0  0.0   0:01.75 init
     2 root      RT  -5     0    0    0 S  0.0  0.0   0:07.51  
migration/0
     3 root      34  19     0    0    0 S  0.0  0.0   0:24.56  
ksoftirqd/0
     4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
     5 root      RT  -5     0    0    0 S  0.0  0.0   0:03.77  
migration/1
     6 root      34  19     0    0    0 S  0.0  0.0   0:04.96  
ksoftirqd/1

This machine is running long jobs from time to time and is hosting  
large databases, so we don't want to reboot it.

How can we find the "job" that is using all the memory and bringing  
the work load up to such a high level?  Is it the zombie that is  
reported in top?

Thanks

w
  10:57:27 up 232 days, 15:25, 18 users,  load average: 619.19,  
619.28, 619.13
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
user1    pts/2    lfps             15Jan10  4days  0.10s  0.10s -tcsh
user1    pts/3    lfps             Thu16   17:45m 44:55  44:54  top
user1    pts/4    lfps             15Jan10 25days  0.10s  0.10s -tcsh
user2      pts/5    gc166-mm.geo.bro Thu16    4days  0.02s  0.01s  
sshd: user2 [priv]
crism    pts/8    molybdenum       Fri13    3days  1:27   1:27  /usr/ 
local/itt/idl70/bin/bin.linux.x8
root     pts/9    :0.0             23Oct09 116days  0.00s  0.00s ssh - 
l user1 moly
wjuser1  pts/10   porter2.geo.brow Mon10    6:01   0.11s  0.11s -tcsh
user2      pts/12   gc166-mm.geo.bro Fri14    0.00s  0.07s  0.00s  
sshd: user2 [priv]
root     :0       -                23Oct09 ?xdm?   2:24m  0.03s /usr/ 
bin/gnome-session
user1    pts/16   lfps             Mon14    3:47  10.30s 10.24s top
user1    pts/14   quahog2.geo.brow Mon15    8:22  17.54s 17.48s top
root     pts/15   :0.0             23Oct09 116days  0.01s  0.01s -bin/ 
tcsh
user1    pts/17   quahog2.geo.brow Mon14   18:19m  0.11s  0.11s -tcsh
root     pts/23   :0.0             23Oct09 116days  0.01s  0.01s -bin/ 
tcsh
root     pts/24   :0.0             23Oct09 116days  0.01s  0.01s -bin/ 
tcsh
user1    pts/28   lfps             15Jan10  4:08   0.12s  0.12s -tcsh
user1    pts/30   lfps             15Jan10  6:01   0.39s  0.00s sshd:  
user1 [priv]
root     pts/7    :0.0             23Oct09 116days  5.78s  0.00s -bin/ 
tcsh

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list