RE: Many instances of CROND running, load raised (RH9)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>  I restarted the crond, so this may be why they all have
PPIDs of 1.

Possibly, you would have orphaned anything that did not obey
or blocked
the SIGKILL on the process group.
>
>  Though vmware isn't sucking up _all_ of the CPU cycles on
>  the machine,
>  top reports that one CPU is hardly being used:

One process can only suck the life of out a single cpu.
However, VMWARE may be like Oracle processes - runs alls the
time
eats cpu's, this might be it's normal expected behaviour.
I dont know personnally as I dont know this software.

>
>  I can't shutdown vmware right now, as that it's in
production.

okay. can not take it out of the picture, so you will have
to relay
on intuitive logical analysis.  (aka, guess what could be in
conflict
if anything between the various processes)

>  >>
>  >>  It just forks another crond, I'm guessing it's
forking so
>  > it can then call exec(2), but then exec(2) is hanging
for some
>  > reason. I  don't know why initd is calling crond
though.

INITD is it's real parent, INITD should be the parent of all
startup
processes as it starts them all through the RC scripts.
The only exception to this rule is master processes that are
launched
by INITD that then do a FORK() and CHANGE PGRP() sequence to
become
there own master process group. However if this master
process
thies leaving orphans it's "unreaped" children will still
become
wards of the state, aka INITD.  ExEC() wont hang, the
program will die
on an exec failure. You are either in a loop waiting on a
resource to come free
or Sleep Blocked waiting on that resource. But this does not
make sense in a normal
SA environment unless VMWARE does disk drive locks as a
semaphore and SA
is obeying a locking protocol for disk access and hanging a
VMWARE disk
it should not be even looking at. (MY SUN HA Cluster uses an
entire
9GB hard drive as a Inter-system exclusion lock. (what a
waste of good disk)

>  > the SA package to see what is wrong, maybe a corrupted
data
>  > or control file
>  > or SA cant handle accounting records for TCP/IP with
super
>  > large frame sizes
>  > (code oversights) if it is monitoring that subsystem as
well
>  > as the disks
>  >
>
>  This seems reasonable, I'll look though the code a bit.
>
>  >>

>  seq:~# ps aux|grep CRON|wc -l
>        51
>  seq:~# uptime
>    14:41:23  up 14 days,  3:54,  1 user,  load average:
>  51.21, 51.16,
>  51.10
>
>  It seems suspicious to me that the load is 51.x on the
>  system and there
>  are exactly 51 CRONDs stuck.

That would mean that the CRONDS/SA's are actually running in
small increments of time
and should show up in TOP as a running process at some short
interval.

>
>
>  (NB: VMware is a virtual machine emulator. It runs
entirely in user
>  land in our case, no LKMs etc.)

That probably explains it's high CPU usage, never blocks on
an event,
constantly polls for new work that is never there, eats CPU
cycles
on idle machines doing nothing - sounds familiar - SUN WABI
and Oracle
are major examples of such bad coding.

>
>  Note timeouts, one is at .1 and the other at .2. Two
loops
>  I'd imagine.
>
>  seq:~# strace -p 2003
>  gettimeofday({1067028221, 648861}, NULL) = 0
>  select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29
>  30], [], NULL,
>  {0, 10000}) = 1 (in [14], left {0, 0})
>  gettimeofday({1067028221, 656734}, NULL) = 0
>  select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29
30],
>  [], NULL,
>  {0, 20000}) = 0 (Timeout)
>  gettimeofday({1067028221, 675903}, NULL) = 0
>
>  select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29
30],
>  [], NULL,
>  {0, 20000}) = 1 (in [12], left {0, 20000})
>  ioctl(12, 0xdf, 0)                      = 0

I dont remember your PID's to Process name table

But this would appear to be a micro-timer sleep loop with
someone
checking the time when the timer goes off. Might be the
CROND master process
you are ptracing. the sleep system call might be implemented
in Linux
with the select microtimer facility.  correction - appears
to be vmware
from your stuff below.

>
>  This open(2) then unlink(2) is just a security "trick" to
>  prevent the
>  file from being seen on the file system...
>
>  Doesn't appear to be hanging on anything obvious...

If this is vmware, thats correct, vmware is not hanging
it master service process is is an infinite loop
probably a normal behaviour.

>  See above trace. Also, this trace looks identical to the
>  other machine
>  traces of vmware.

Then VMWARE is behaving normally for itself and should not
be related to
your SA issue. back to square one.

>
>  >
>
>  seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep cron
>    2415 ?        root     crond        - 1 do_fork
>    2416 ?        root     crond        - 5
schedule_timeout
>    2426 ?        root     crond        - 1 do_fork
>    2427 ?        root     crond        - 5
schedule_timeout
>    2442 ?        root     crond        - 1 do_fork
>    2443 ?        root     crond        - 5
schedule_timeout
>    2453 ?        root     crond        - 1 do_fork
>    2454 ?        root     crond        - 5
schedule_timeout
>    2455 ?        root     crond        - 1 do_fork
>    2456 ?        root     crond        - 5
schedule_timeout
>    2466 ?        root     crond        - 1 do_fork
>    2467 ?        root     crond        - 5
schedule_timeout
>    2480 ?        root     crond        - 1 do_fork
>    2481 ?        root     crond        - 5
schedule_timeout
>    2491 ?        root     crond        - 1 do_fork

This sort of looks like you have multiple master or
submaster CROND's running.
The children never do sleep timeouts normally
it is normal FORK --> EXEC -> new process image.
So either you have  a bug in the CROND launch script or some
glich
that forked off multiple CRONDS parents in the RC script
startup
 OR
have a watchdog script that is supposed to check for crond
and restart it
if it dies and That watchdog script is broken and keeps
starting
up new CROND masters

OR

the subchild crond is in a infinite fork-exec loop trying to
exec() a process
out of the crontab file and failing.

>
>  seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep vmware
>    1920 ?        vmware   start-se     - 4 wait4
>    1942 ?        vmware   vmware       - 0
schedule_timeout
>    1950 ?        vmware   vmware-v     - 4
schedule_timeout
>    1951 ?        vmware   vmware-m     - 4
schedule_timeout
>    1958 ?        vmware   vmware-v     - 5
schedule_timeout
>    1959 ?        vmware   vmware-v     - 5
schedule_timeout
>    1960 ?        vmware   vmware-v     - 5
schedule_timeout
>    1961 ?        vmware   vmware-v     - 5
schedule_timeout
>    2004 ?        vmware   vmware-m     - 4
schedule_timeout
>    2005 ?        vmware   vmware-v     - 5
schedule_timeout
>    2006 ?        vmware   vmware-v     - 5
schedule_timeout
>    2007 ?        vmware   vmware-v     - 5
schedule_timeout
>    2008 ?        vmware   vmware-v     - 5
schedule_timeout
>    2034 ?        vmware   vmware-v     - 4
schedule_timeout
>    2035 ?        vmware   vmware-m     - 4
schedule_timeout
>    2036 ?        vmware   vmware-v     - 5
schedule_timeout
>    2037 ?        vmware   vmware-v     - 5
schedule_timeout
>    2038 ?        vmware   vmware-v     - 5
schedule_timeout
>    2003 ?        vmware   vmware-v     - 4
schedule_timeout

I would guess that VMWARE has a master control thread and
then
service threads/subprocesses that actually do the processing
this is how informix and oracle work and it is a similiar
program.



-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list

[Index of Archives]     [CentOS]     [Kernel Development]     [PAM]     [Fedora Users]     [Red Hat Development]     [Big List of Linux Books]     [Linux Admin]     [Gimp]     [Asterisk PBX]     [Yosemite News]     [Red Hat Crash Utility]


  Powered by Linux