Re: Many instances of CROND running, load raised (RH9)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Friday, October 24, 2003, at 01:12 PM, Kenneth Goodwin wrote: ...

They all have PPID's of 1, oddly enough. Good point.

wELL THAT IS NOT RIGHT, THAT probably means either your system got a major issue here or more likley they died already, are orphaned from their Parent CROND and are now inherited by INITD (normal for this situation) and INITD has not reaped the dead children yet probably because VMWARE is sucking up all the cpu cylces.

I restarted the crond, so this may be why they all have PPIDs of 1.


Though vmware isn't sucking up _all_ of the CPU cycles on the machine, top reports that one CPU is hardly being used:

CPU0 states: 0.0% user 100.0% system 0.0% nice 0.0% iowait 0.0% idle
CPU1 states: 0.3% user 4.4% system 0.0% nice 0.0% iowait 94.2% idle



seq:~# time /usr/lib/sa/sa1 1 1 /usr/lib/sa/sa1 1 1 0.00s user 0.02s system 173% cpu
0.012 total

Is vmware running full throttle will you are doing this?

Vmware is running.


I think it is the core part of your issue.
What happens to SA1 times iof you shutdown VMWARE and then
run it?

I can't shutdown vmware right now, as that it's in production.


Whats the difference in elasped real time clock run times
between
sa1 on the affected systems and the sa1 on the baseline
system?

The difference between ack and seq is .012 - .007 = .005, so not much.


Another system not running vmware at all, comes out exactly at the same as machine ack:
# time /usr/lib/sa/sa1
/usr/lib/sa/sa1 0.00s user 0.00s system 0% cpu 0.007 total



It just forks another crond, I'm guessing it's forking so
it
 can then
 call exec(2), but then exec(2) is hanging for some
reason. I
 don't know
 why initd is calling crond though.

Initd strats up all the sub processes in the system in the RC scripts It is the Master parent process and should be the PPID for the INITIALLY launched CROND daemon only. It becomes the PPID of children that are orphaned by their parents in a weird way. INITD->CROND->CRONDSUB->YOURCRONTASK

if the CRONDSUB process does a fork and exec and wait cycle
on YOURCRONTASK
instead of a straight EXEC call to replace itself with your
process,
which is probably the case sinc ethe CRONSUB probably
handles the logging
in the cron log file as well, then
if the CRONDSUB process dies before YOURCRONTASK dies, then
the new PPID
of YOURCRONTASK will become INITD. I believe MASTER CROND is
forking it's
children CRINDSUB's as separate process groups to protect
itself against
stray signalling to the proces group that might kill it off.

The crond forking crond are weird, suspect the second one
(2416) off of the
first 2415 is the actual entry from your crontab being
launched and based
on your statements should actually be showing up as SA1 at
this point.


That's what I would expect too.


you could Shutdown the calls to SA1 and the rest of the
accountiong package
in the cron files, resubmit the root cronn table and see if
the problem
follows SA1 and goes away. If it does, you will need to tear
apart
the SA package to see what is wrong, maybe a corrupted data
or control file
or SA cant handle accounting records for TCP/IP with super
large frame sizes
(code oversights) if it is monitoring that subsystem as well
as the disks


This seems reasonable, I'll look though the code a bit.



My feeling is that the reported CPU usage for VMware is being skewed by the blocking CRONDs, as well as the load ave.

No I think VMware is the issue, the CROND's seem to be dead children dont know why they are spawning like this, you list does not show start times it could be that they are normally launched 5 minute interval SA's that are all hanging asleep in some deadlock state on a resource VMware has locked.

I would focus on why VMware is misbehaving here, it's eating
the CPU alive.


seq:~# ps aux|grep CRON|wc -l
51
seq:~# uptime
14:41:23 up 14 days, 3:54, 1 user, load average: 51.21, 51.16, 51.10


It seems suspicious to me that the load is 51.x on the system and there are exactly 51 CRONDs stuck.

BTW, all of these processes were started on Oct 21:

seq:~# ps aux|grep CROND 0 15:07:02
root 2415 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2426 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2442 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2453 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2455 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2466 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2480 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2491 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2502 0.0 0.0 4292 4 ? D Oct21 0:00 CROND




Processes going in and out of run state basically, could be VMware in a high speed loop blocking on something for a microsecond and then waking up on a alarm timeout (one of the functions of select() as I recall is to also control fine granularity microtimers() aND ALARMS AS WELL as await events on TCP/IP sockets.


(NB: VMware is a virtual machine emulator. It runs entirely in user land in our case, no LKMs etc.)


Note timeouts, one is at .1 and the other at .2. Two loops I'd imagine.

seq:~# strace -p 2003
gettimeofday({1067028221, 648861}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 656734}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 0 (Timeout)
gettimeofday({1067028221, 675903}, NULL) = 0


select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 1 (in [12], left {0, 20000})
ioctl(12, 0xdf, 0) = 0
^^^


Here fd 12 came ready and thus ioctl(2) was called on it. I'm not sure what 0xdf means here though. I couldn't find anything obvious in the man pages/ioctl header files. Though a brute force grep of all the header files revealed these defines:

/usr/include/linux/hdreg.h:#define WIN_DOORUNLOCK 0xDF /* unlock door on removable drives */
/usr/include/linux/pmu.h:#define PMU_SYSTEM_READY 0xdf /* tell PMU we are awake */
/usr/include/linux/ps2esdi.h:#define INTG_ESDI_ID 0xdf9f
/usr/include/linux/serial167.h:#define CyMSVR2 (0xdf)


Continued...

gettimeofday({1067028221, 676547}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 686045}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
select(813, [812], NULL, NULL, {0, 100000}) = 1 (in [812], left {0, 100000})
read(812, "\0", 1) = 1
gettimeofday({1067028221, 697514}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 1 (in [12], left {0, 20000})
ioctl(12, 0xdf, 0) = 0
gettimeofday({1067028221, 698545}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 706173}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 716348}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 726059}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 736216}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 0 (Timeout)
gettimeofday({1067028221, 755856}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 2 (in [12 14], left {0, 10000})
ioctl(12, 0xdf, 0) = 0
gettimeofday({1067028221, 756478}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 766361}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 2 (in [12 14], left {0, 10000})
ioctl(12, 0xdf, 0) = 0
gettimeofday({1067028221, 767326}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 776048}, NULL) = 0
kill(2004, SIG_0) = 0
kill(2003, SIG_0) = 0
gettimeofday({1067028221, 776713}, NULL) = 0
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
open("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f8", O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 35
unlink("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f8") = 0


This open(2) then unlink(2) is just a security "trick" to prevent the file from being seen on the file system...

write(35, "X", 1)                       = 1
close(35)                               = 0
getpid()                                = 2003
ioctl(12, 0xdb, 0xbfffecc0)             = 0
ioctl(12, 0xda, 0xbfffece0)             = 0

Again, I don't know what 0xdb and 0xda are.

getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
write(813, "\0", 1) = 1
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 1 (in [17], left {0, 20000})
close(17) = 0
mknod("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f9", S_IFIFO|0600) = 0
open("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f9", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 17
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
write(16, "6 TUPLES1a /vm/#14a92c078188eeaf"..., 141) = 141
getpid() = 2003
gettimeofday({1067028221, 786217}, NULL) = 0
write(813, "\277", 1) = 1


Doesn't appear to be hanging on anything obvious...


this is either a spin loop on a TCP/IP resource or VMware is trying to get access to a resource and is using the select() microtimer interface to abort the call if it blocks for more than N microseconds or so.


See above trace. Also, this trace looks identical to the other machine traces of vmware.


Does VMware and SA have any system resource in common? I am

Not directly that I can tell. But SA would naturally be interested in all resources on the system.


SA isn't hung as that the machine's stats are current also...

not familiar
with how VMware operates or what it does? seems like

Virtual machine software.


something may be
out of whack here, that both SA and Vmware need access to
and VMware
is spinning it's wheels and sucking cpu trying to gain
access
to that resource while  SA is blocking on it, not dying in
the
approved timeframes and is being joined in a few minutes by
the next
instance of SA launched by CROND, so you eevntually collect
a sleeping
crowd of SA's in additon to the VMware issue.


Doesn't look like it...


Maybe an on-disk lock file left behind that an RC.d script
should be removing?

Don't see any lock files...


I cant guess, I think you need to find out the actual
resources
that VMware and Sa are blocking on. The Wchan column in PS
which is a kernel virtual menory address of the resource
being waited on compared against
kernel symbol table name to address map may provide the
needed clue
to what resource is hanging these two.


seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep cron 2415 ? root crond - 1 do_fork 2416 ? root crond - 5 schedule_timeout 2426 ? root crond - 1 do_fork 2427 ? root crond - 5 schedule_timeout 2442 ? root crond - 1 do_fork 2443 ? root crond - 5 schedule_timeout 2453 ? root crond - 1 do_fork 2454 ? root crond - 5 schedule_timeout 2455 ? root crond - 1 do_fork 2456 ? root crond - 5 schedule_timeout 2466 ? root crond - 1 do_fork 2467 ? root crond - 5 schedule_timeout 2480 ? root crond - 1 do_fork 2481 ? root crond - 5 schedule_timeout 2491 ? root crond - 1 do_fork ...

seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep vmware
 1920 ?        vmware   start-se     - 4 wait4
 1942 ?        vmware   vmware       - 0 schedule_timeout
 1950 ?        vmware   vmware-v     - 4 schedule_timeout
 1951 ?        vmware   vmware-m     - 4 schedule_timeout
 1958 ?        vmware   vmware-v     - 5 schedule_timeout
 1959 ?        vmware   vmware-v     - 5 schedule_timeout
 1960 ?        vmware   vmware-v     - 5 schedule_timeout
 1961 ?        vmware   vmware-v     - 5 schedule_timeout
 2004 ?        vmware   vmware-m     - 4 schedule_timeout
 2005 ?        vmware   vmware-v     - 5 schedule_timeout
 2006 ?        vmware   vmware-v     - 5 schedule_timeout
 2007 ?        vmware   vmware-v     - 5 schedule_timeout
 2008 ?        vmware   vmware-v     - 5 schedule_timeout
 2034 ?        vmware   vmware-v     - 4 schedule_timeout
 2035 ?        vmware   vmware-m     - 4 schedule_timeout
 2036 ?        vmware   vmware-v     - 5 schedule_timeout
 2037 ?        vmware   vmware-v     - 5 schedule_timeout
 2038 ?        vmware   vmware-v     - 5 schedule_timeout
 2003 ?        vmware   vmware-v     - 4 schedule_timeout


I think it is the NM command that dumps symbol tables
but i am not sure if this is correct for Linux, but I think
there is a file map under /boot that also contains a map of
addresses to loader symbol names for use in dynamically
loading
kernel library objects


ps does the look up as it turns out...


man or info select(0 to verify microtimer feature.


It's using select to detect when fds come ready, it's not using it just as a timer.



Cove



-- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list

[Index of Archives]     [CentOS]     [Kernel Development]     [PAM]     [Fedora Users]     [Red Hat Development]     [Big List of Linux Books]     [Linux Admin]     [Gimp]     [Asterisk PBX]     [Yosemite News]     [Red Hat Crash Utility]


  Powered by Linux