Re: Many instances of CROND running, load raised (RH9)

Cove Schneider <cove@xxxxxxxxxxxxxxx> · Fri, 24 Oct 2003 15:11:25 -0700

On Friday, October 24, 2003, at 01:12 PM, Kenneth Goodwin wrote:
...

 They all have PPID's of 1, oddly enough. Good point.

wELL THAT IS NOT RIGHT, THAT probably means either your
system
got a major issue here or more likley they died already, are
orphaned
from their Parent CROND and are now inherited by INITD
(normal for this situation)
and INITD has not reaped the dead children yet probably
because VMWARE
is sucking up all the cpu cylces.

I restarted the crond, so this may be why they all have PPIDs of 1.

Though vmware isn't sucking up _all_ of the CPU cycles on the machine,  
top reports that one CPU is hardly being used:

CPU0 states:   0.0% user 100.0% system    0.0% nice   0.0% iowait    
0.0% idle

CPU1 states:   0.3% user   4.4% system    0.0% nice   0.0% iowait   
94.2% idle

 seq:~# time /usr/lib/sa/sa1 1 1
 /usr/lib/sa/sa1 1 1  0.00s user 0.02s system 173% cpu
0.012 total

Is vmware running full throttle will you are doing this?

Vmware is running.

I think it is the core part of your issue.
What happens to SA1 times iof you shutdown VMWARE and then
run it?

I can't shutdown vmware right now, as that it's in production.

Whats the difference in elasped real time clock run times
between
sa1 on the affected systems and the sa1 on the baseline
system?

The difference between ack and seq is .012 - .007 = .005, so not much.

Another system not running vmware at all, comes out exactly at the same  
as machine ack:

# time /usr/lib/sa/sa1

/usr/lib/sa/sa1  0.00s user 0.00s system 0% cpu 0.007 total

 It just forks another crond, I'm guessing it's forking so
it
 can then
 call exec(2), but then exec(2) is hanging for some
reason. I
 don't know
 why initd is calling crond though.

Initd strats up all the sub processes in the system in the
RC scripts
It is the Master parent process and should be the PPID for
the INITIALLY
launched CROND daemon only. It becomes the PPID of children
that are orphaned
by their parents in a weird way.
INITD->CROND->CRONDSUB->YOURCRONTASK

if the CRONDSUB process does a fork and exec and wait cycle
on YOURCRONTASK
instead of a straight EXEC call to replace itself with your
process,
which is probably the case sinc ethe CRONSUB probably
handles the logging
in the cron log file as well, then
if the CRONDSUB process dies before YOURCRONTASK dies, then
the new PPID
of YOURCRONTASK will become INITD. I believe MASTER CROND is
forking it's
children CRINDSUB's as separate process groups to protect
itself against
stray signalling to the proces group that might kill it off.

The crond forking crond are weird, suspect the second one
(2416) off of the
first 2415 is the actual entry from your crontab being
launched and based
on your statements should actually be showing up as SA1 at
this point.

That's what I would expect too.

you could Shutdown the calls to SA1 and the rest of the
accountiong package
in the cron files, resubmit the root cronn table and see if
the problem
follows SA1 and goes away. If it does, you will need to tear
apart
the SA package to see what is wrong, maybe a corrupted data
or control file
or SA cant handle accounting records for TCP/IP with super
large frame sizes
(code oversights) if it is monitoring that subsystem as well
as the disks

This seems reasonable, I'll look though the code a bit.

 My feeling is that the reported CPU usage for VMware is
 being skewed by
 the blocking CRONDs, as well as the load ave.

No I think VMware is the issue, the CROND's seem to be dead
children
dont know why they are spawning like this, you list does not
show start times
it could be that they are normally launched 5 minute
interval SA's that are all
hanging asleep in some deadlock state on a resource VMware
has locked.

I would focus on why VMware is misbehaving here, it's eating
the CPU alive.

seq:~# ps aux|grep CRON|wc -l

     51

seq:~# uptime

 14:41:23  up 14 days,  3:54,  1 user,  load average: 51.21, 51.16,  
51.10

It seems suspicious to me that the load is 51.x on the system and there  
are exactly 51 CRONDs stuck.

BTW, all of these processes were started on Oct 21:

seq:~# ps aux|grep CROND                                                 
                        0 15:07:02

root      2415  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2426  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2442  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2453  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2455  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2466  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2480  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2491  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

root      2502  0.0  0.0  4292    4 ?        D    Oct21   0:00 CROND

Processes going in and out of run state basically, could be
VMware
in  a high speed loop blocking on something for a
microsecond and then
waking up on a alarm timeout (one of the functions of
select() as I recall is to
also control fine granularity microtimers() aND ALARMS AS
WELL as await events on TCP/IP sockets.

(NB: VMware is a virtual machine emulator. It runs entirely in user  
land in our case, no LKMs etc.)

Note timeouts, one is at .1 and the other at .2. Two loops I'd imagine.

seq:~# strace -p 2003

gettimeofday({1067028221, 648861}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 656734}, NULL) = 0

select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 20000}) = 0 (Timeout)

gettimeofday({1067028221, 675903}, NULL) = 0

select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 20000}) = 1 (in [12], left {0, 20000})

ioctl(12, 0xdf, 0)                      = 0

^^^

Here fd 12 came ready and thus ioctl(2) was called on it. I'm not sure  
what 0xdf means here though. I couldn't find anything obvious in the  
man pages/ioctl header files. Though a brute force grep of all the  
header files revealed these defines:

/usr/include/linux/hdreg.h:#define WIN_DOORUNLOCK               0xDF     
/* unlock door on removable drives */

/usr/include/linux/pmu.h:#define PMU_SYSTEM_READY       0xdf    /* tell  
PMU we are awake */

/usr/include/linux/ps2esdi.h:#define INTG_ESDI_ID 0xdf9f

/usr/include/linux/serial167.h:#define CyMSVR2          (0xdf)

Continued...

gettimeofday({1067028221, 676547}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 686045}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

select(813, [812], NULL, NULL, {0, 100000}) = 1 (in [812], left {0,  
100000})

read(812, "\0", 1)                      = 1

gettimeofday({1067028221, 697514}, NULL) = 0

select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 20000}) = 1 (in [12], left {0, 20000})

ioctl(12, 0xdf, 0)                      = 0

gettimeofday({1067028221, 698545}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 706173}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 716348}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 726059}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 736216}, NULL) = 0

select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 20000}) = 0 (Timeout)

gettimeofday({1067028221, 755856}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 2 (in [12 14], left {0, 10000})

ioctl(12, 0xdf, 0)                      = 0

gettimeofday({1067028221, 756478}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 766361}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 2 (in [12 14], left {0, 10000})

ioctl(12, 0xdf, 0)                      = 0

gettimeofday({1067028221, 767326}, NULL) = 0

select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 10000}) = 1 (in [14], left {0, 0})

gettimeofday({1067028221, 776048}, NULL) = 0

kill(2004, SIG_0)                       = 0

kill(2003, SIG_0)                       = 0

gettimeofday({1067028221, 776713}, NULL) = 0

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

open("/var/run/vmware/vmware/vmx_2003/ 
ctxWQ2003.0x4027c6ac.000000000014e7f8",  
O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 35

unlink("/var/run/vmware/vmware/vmx_2003/ 
ctxWQ2003.0x4027c6ac.000000000014e7f8") = 0

This open(2) then unlink(2) is just a security "trick" to prevent the  
file from being seen on the file system...

write(35, "X", 1)                       = 1
close(35)                               = 0
getpid()                                = 2003
ioctl(12, 0xdb, 0xbfffecc0)             = 0
ioctl(12, 0xda, 0xbfffece0)             = 0

Again, I don't know what 0xdb and 0xda are.

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

write(813, "\0", 1)                     = 1

select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL,  
{0, 20000}) = 1 (in [17], left {0, 20000})

close(17)                               = 0

mknod("/var/run/vmware/vmware/vmx_2003/ 
ctxWQ2003.0x4027c6ac.000000000014e7f9", S_IFIFO|0600) = 0

open("/var/run/vmware/vmware/vmx_2003/ 
ctxWQ2003.0x4027c6ac.000000000014e7f9",  
O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 17

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

getpid()                                = 2003

write(16, "6 TUPLES1a /vm/#14a92c078188eeaf"..., 141) = 141

getpid()                                = 2003

gettimeofday({1067028221, 786217}, NULL) = 0

write(813, "\277", 1)                   = 1

Doesn't appear to be hanging on anything obvious...

this is either a spin loop on a TCP/IP resource or VMware is
trying
to get access to a resource and is using the select()
microtimer interface
to abort the call if it blocks for more than N microseconds
or so.

See above trace. Also, this trace looks identical to the other machine  
traces of vmware.

Does VMware and SA have any system resource in common? I am

Not directly that I can tell. But SA would naturally be interested in  
all resources on the system.

SA isn't hung as that the machine's stats are current also...

not familiar
with how VMware operates or what it does? seems like

Virtual machine software.

something may be
out of whack here, that both SA and Vmware need access to
and VMware
is spinning it's wheels and sucking cpu trying to gain
access
to that resource while  SA is blocking on it, not dying in
the
approved timeframes and is being joined in a few minutes by
the next
instance of SA launched by CROND, so you eevntually collect
a sleeping
crowd of SA's in additon to the VMware issue.

Doesn't look like it...

Maybe an on-disk lock file left behind that an RC.d script
should be removing?

Don't see any lock files...

I cant guess, I think you need to find out the actual
resources
that VMware and Sa are blocking on. The Wchan column in PS
which is a kernel virtual menory address of the resource
being waited on compared against
kernel symbol table name to address map may provide the
needed clue
to what resource is hanging these two.

seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep cron
 2415 ?        root     crond        - 1 do_fork
 2416 ?        root     crond        - 5 schedule_timeout
 2426 ?        root     crond        - 1 do_fork
 2427 ?        root     crond        - 5 schedule_timeout
 2442 ?        root     crond        - 1 do_fork
 2443 ?        root     crond        - 5 schedule_timeout
 2453 ?        root     crond        - 1 do_fork
 2454 ?        root     crond        - 5 schedule_timeout
 2455 ?        root     crond        - 1 do_fork
 2456 ?        root     crond        - 5 schedule_timeout
 2466 ?        root     crond        - 1 do_fork
 2467 ?        root     crond        - 5 schedule_timeout
 2480 ?        root     crond        - 1 do_fork
 2481 ?        root     crond        - 5 schedule_timeout
 2491 ?        root     crond        - 1 do_fork
...

seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep vmware
 1920 ?        vmware   start-se     - 4 wait4
 1942 ?        vmware   vmware       - 0 schedule_timeout
 1950 ?        vmware   vmware-v     - 4 schedule_timeout
 1951 ?        vmware   vmware-m     - 4 schedule_timeout
 1958 ?        vmware   vmware-v     - 5 schedule_timeout
 1959 ?        vmware   vmware-v     - 5 schedule_timeout
 1960 ?        vmware   vmware-v     - 5 schedule_timeout
 1961 ?        vmware   vmware-v     - 5 schedule_timeout
 2004 ?        vmware   vmware-m     - 4 schedule_timeout
 2005 ?        vmware   vmware-v     - 5 schedule_timeout
 2006 ?        vmware   vmware-v     - 5 schedule_timeout
 2007 ?        vmware   vmware-v     - 5 schedule_timeout
 2008 ?        vmware   vmware-v     - 5 schedule_timeout
 2034 ?        vmware   vmware-v     - 4 schedule_timeout
 2035 ?        vmware   vmware-m     - 4 schedule_timeout
 2036 ?        vmware   vmware-v     - 5 schedule_timeout
 2037 ?        vmware   vmware-v     - 5 schedule_timeout
 2038 ?        vmware   vmware-v     - 5 schedule_timeout
 2003 ?        vmware   vmware-v     - 4 schedule_timeout

I think it is the NM command that dumps symbol tables
but i am not sure if this is correct for Linux, but I think
there is a file map under /boot that also contains a map of
addresses to loader symbol names for use in dynamically
loading
kernel library objects

ps does the look up as it turns out...

man  or info select(0 to verify microtimer feature.

It's using select to detect when fds come ready, it's not using it just  
as a timer.

Cove

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list