Re: Looking for job which is causing a large work load

Margaret Doll <Margaret_Doll@xxxxxxxxx> · Tue, 16 Feb 2010 13:56:49 -0500

Thanks.  I will look at atop for my systems.

On Feb 16, 2010, at 1:03 PM, Alan A wrote:

Install atop - the best tool for tracking runaway processes/ user  
abuse/
network utilization -etc...

On Tue, Feb 16, 2010 at 10:18 AM, Stainforth, Matthew (SD/DS) <
Matthew.Stainforth@xxxxxx> wrote:

Memory doesn't appear to be a problem.  Run "free" and look at the  
amount
of free memory on the "+/- buffers/cache" line.

Top is reporting 3419 processes total with 600+ in a runnable  
state.  What
does "ps auwwx" tell you?

-----Original Message-----
From: redhat-list-bounces@xxxxxxxxxx [mailto:
redhat-list-bounces@xxxxxxxxxx] On Behalf Of Margaret Doll
Sent: Tuesday, February 16, 2010 11:54 AM
To: General Red Hat Linux discussion list
Subject: Looking for job which is causing a large work load

We have an eight processor system, running 2.6.18-128.1.6.el5xen
Redhat.

We noticed the other day that sendmail was just queuing jobs and not
sending them.
mqueue, however, is empty.

That lead us to look at the load average as a possible reason for the
failure of sendmail.
The QueueLA on sendmail is set to "8" as it should be.

w and top show that we have a high load average and most of the  
memory
on the system
is being used.  However, no job shows up in top using a lot of  
memory.

top - 10:50:52 up 232 days, 15:18, 20 users,  load average: 619.06,
619.04, 618.98
Tasks: 3419 total,   1 running, 3417 sleeping,   0 stopped,   1  
zombie
Cpu(s):  0.3%us,  0.9%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,
0.0%si,  0.0%st
Mem:  16099528k total, 16063880k used,    35648k free,   487200k  
buffers
Swap:  6127608k total,   105920k used,  6021688k free, 12683800k  
cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11917 user1     16   0 13424 3624  784 S  3.8  0.0   0:04.16 top
11922 root      16   0 13360 3624  776 R  3.8  0.0   0:00.39 top
8187 user1     16   0 13356 3620  780 S  3.5  0.0  44:48.71 top
11895 user1     16   0 13452 3648  780 R  3.5  0.0   0:11.35 top
   1 root      15   0 10348  632  540 S  0.0  0.0   0:01.75 init
   2 root      RT  -5     0    0    0 S  0.0  0.0   0:07.51
migration/0
   3 root      34  19     0    0    0 S  0.0  0.0   0:24.56
ksoftirqd/0
   4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00  
watchdog/0
   5 root      RT  -5     0    0    0 S  0.0  0.0   0:03.77
migration/1
   6 root      34  19     0    0    0 S  0.0  0.0   0:04.96
ksoftirqd/1

This machine is running long jobs from time to time and is hosting
large databases, so we don't want to reboot it.

How can we find the "job" that is using all the memory and bringing
the work load up to such a high level?  Is it the zombie that is
reported in top?

Thanks

w
10:57:27 up 232 days, 15:25, 18 users,  load average: 619.19,
619.28, 619.13
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
user1    pts/2    lfps             15Jan10  4days  0.10s  0.10s -tcsh
user1    pts/3    lfps             Thu16   17:45m 44:55  44:54  top
user1    pts/4    lfps             15Jan10 25days  0.10s  0.10s -tcsh
user2      pts/5    gc166-mm.geo.bro Thu16    4days  0.02s  0.01s
sshd: user2 [priv]
crism    pts/8    molybdenum       Fri13    3days  1:27   1:27  /usr/
local/itt/idl70/bin/bin.linux.x8
root     pts/9    :0.0             23Oct09 116days  0.00s  0.00s  
ssh -
l user1 moly
wjuser1  pts/10   porter2.geo.brow Mon10    6:01   0.11s  0.11s -tcsh
user2      pts/12   gc166-mm.geo.bro Fri14    0.00s  0.07s  0.00s
sshd: user2 [priv]
root     :0       -                23Oct09 ?xdm?   2:24m  0.03s /usr/
bin/gnome-session
user1    pts/16   lfps             Mon14    3:47  10.30s 10.24s top
user1    pts/14   quahog2.geo.brow Mon15    8:22  17.54s 17.48s top
root     pts/15   :0.0             23Oct09 116days  0.01s  0.01s - 
bin/
tcsh
user1    pts/17   quahog2.geo.brow Mon14   18:19m  0.11s  0.11s -tcsh
root     pts/23   :0.0             23Oct09 116days  0.01s  0.01s - 
bin/
tcsh
root     pts/24   :0.0             23Oct09 116days  0.01s  0.01s - 
bin/
tcsh
user1    pts/28   lfps             15Jan10  4:08   0.12s  0.12s -tcsh
user1    pts/30   lfps             15Jan10  6:01   0.39s  0.00s sshd:
user1 [priv]
root     pts/7    :0.0             23Oct09 116days  5.78s  0.00s - 
bin/
tcsh

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

--
Alan A.
--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list