On Friday, October 24, 2003, at 01:12 PM, Kenneth Goodwin wrote: ...
They all have PPID's of 1, oddly enough. Good point.
wELL THAT IS NOT RIGHT, THAT probably means either your system got a major issue here or more likley they died already, are orphaned from their Parent CROND and are now inherited by INITD (normal for this situation) and INITD has not reaped the dead children yet probably because VMWARE is sucking up all the cpu cylces.
I restarted the crond, so this may be why they all have PPIDs of 1.
Though vmware isn't sucking up _all_ of the CPU cycles on the machine, top reports that one CPU is hardly being used:
CPU0 states: 0.0% user 100.0% system 0.0% nice 0.0% iowait 0.0% idle
CPU1 states: 0.3% user 4.4% system 0.0% nice 0.0% iowait 94.2% idle
0.012 total
seq:~# time /usr/lib/sa/sa1 1 1 /usr/lib/sa/sa1 1 1 0.00s user 0.02s system 173% cpu
Is vmware running full throttle will you are doing this?
Vmware is running.
I think it is the core part of your issue. What happens to SA1 times iof you shutdown VMWARE and then run it?
I can't shutdown vmware right now, as that it's in production.
Whats the difference in elasped real time clock run times between sa1 on the affected systems and the sa1 on the baseline system?
The difference between ack and seq is .012 - .007 = .005, so not much.
Another system not running vmware at all, comes out exactly at the same as machine ack:
# time /usr/lib/sa/sa1
/usr/lib/sa/sa1 0.00s user 0.00s system 0% cpu 0.007 total
it
It just forks another crond, I'm guessing it's forking soreason. Ican then call exec(2), but then exec(2) is hanging for somedon't know why initd is calling crond though.
Initd strats up all the sub processes in the system in the RC scripts It is the Master parent process and should be the PPID for the INITIALLY launched CROND daemon only. It becomes the PPID of children that are orphaned by their parents in a weird way. INITD->CROND->CRONDSUB->YOURCRONTASK
if the CRONDSUB process does a fork and exec and wait cycle on YOURCRONTASK instead of a straight EXEC call to replace itself with your process, which is probably the case sinc ethe CRONSUB probably handles the logging in the cron log file as well, then if the CRONDSUB process dies before YOURCRONTASK dies, then the new PPID of YOURCRONTASK will become INITD. I believe MASTER CROND is forking it's children CRINDSUB's as separate process groups to protect itself against stray signalling to the proces group that might kill it off.
The crond forking crond are weird, suspect the second one (2416) off of the first 2415 is the actual entry from your crontab being launched and based on your statements should actually be showing up as SA1 at this point.
That's what I would expect too.
you could Shutdown the calls to SA1 and the rest of the accountiong package in the cron files, resubmit the root cronn table and see if the problem follows SA1 and goes away. If it does, you will need to tear apart the SA package to see what is wrong, maybe a corrupted data or control file or SA cant handle accounting records for TCP/IP with super large frame sizes (code oversights) if it is monitoring that subsystem as well as the disks
This seems reasonable, I'll look though the code a bit.
My feeling is that the reported CPU usage for VMware is being skewed by the blocking CRONDs, as well as the load ave.
No I think VMware is the issue, the CROND's seem to be dead children dont know why they are spawning like this, you list does not show start times it could be that they are normally launched 5 minute interval SA's that are all hanging asleep in some deadlock state on a resource VMware has locked.
I would focus on why VMware is misbehaving here, it's eating the CPU alive.
seq:~# ps aux|grep CRON|wc -l
51
seq:~# uptime
14:41:23 up 14 days, 3:54, 1 user, load average: 51.21, 51.16, 51.10
It seems suspicious to me that the load is 51.x on the system and there are exactly 51 CRONDs stuck.
BTW, all of these processes were started on Oct 21:
seq:~# ps aux|grep CROND 0 15:07:02
root 2415 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2426 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2442 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2453 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2455 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2466 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2480 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2491 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
root 2502 0.0 0.0 4292 4 ? D Oct21 0:00 CROND
Processes going in and out of run state basically, could be VMware in a high speed loop blocking on something for a microsecond and then waking up on a alarm timeout (one of the functions of select() as I recall is to also control fine granularity microtimers() aND ALARMS AS WELL as await events on TCP/IP sockets.
(NB: VMware is a virtual machine emulator. It runs entirely in user land in our case, no LKMs etc.)
Note timeouts, one is at .1 and the other at .2. Two loops I'd imagine.
seq:~# strace -p 2003
gettimeofday({1067028221, 648861}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 656734}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 0 (Timeout)
gettimeofday({1067028221, 675903}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 1 (in [12], left {0, 20000})
ioctl(12, 0xdf, 0) = 0
^^^
Here fd 12 came ready and thus ioctl(2) was called on it. I'm not sure what 0xdf means here though. I couldn't find anything obvious in the man pages/ioctl header files. Though a brute force grep of all the header files revealed these defines:
/usr/include/linux/hdreg.h:#define WIN_DOORUNLOCK 0xDF /* unlock door on removable drives */
/usr/include/linux/pmu.h:#define PMU_SYSTEM_READY 0xdf /* tell PMU we are awake */
/usr/include/linux/ps2esdi.h:#define INTG_ESDI_ID 0xdf9f
/usr/include/linux/serial167.h:#define CyMSVR2 (0xdf)
Continued...
gettimeofday({1067028221, 676547}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 686045}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
select(813, [812], NULL, NULL, {0, 100000}) = 1 (in [812], left {0, 100000})
read(812, "\0", 1) = 1
gettimeofday({1067028221, 697514}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 1 (in [12], left {0, 20000})
ioctl(12, 0xdf, 0) = 0
gettimeofday({1067028221, 698545}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 706173}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 716348}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 726059}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 736216}, NULL) = 0
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 0 (Timeout)
gettimeofday({1067028221, 755856}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 2 (in [12 14], left {0, 10000})
ioctl(12, 0xdf, 0) = 0
gettimeofday({1067028221, 756478}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 766361}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 2 (in [12 14], left {0, 10000})
ioctl(12, 0xdf, 0) = 0
gettimeofday({1067028221, 767326}, NULL) = 0
select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 10000}) = 1 (in [14], left {0, 0})
gettimeofday({1067028221, 776048}, NULL) = 0
kill(2004, SIG_0) = 0
kill(2003, SIG_0) = 0
gettimeofday({1067028221, 776713}, NULL) = 0
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
open("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f8", O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 35
unlink("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f8") = 0
This open(2) then unlink(2) is just a security "trick" to prevent the file from being seen on the file system...
write(35, "X", 1) = 1 close(35) = 0 getpid() = 2003 ioctl(12, 0xdb, 0xbfffecc0) = 0 ioctl(12, 0xda, 0xbfffece0) = 0
Again, I don't know what 0xdb and 0xda are.
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
write(813, "\0", 1) = 1
select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], [], NULL, {0, 20000}) = 1 (in [17], left {0, 20000})
close(17) = 0
mknod("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f9", S_IFIFO|0600) = 0
open("/var/run/vmware/vmware/vmx_2003/ ctxWQ2003.0x4027c6ac.000000000014e7f9", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 17
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
getpid() = 2003
write(16, "6 TUPLES1a /vm/#14a92c078188eeaf"..., 141) = 141
getpid() = 2003
gettimeofday({1067028221, 786217}, NULL) = 0
write(813, "\277", 1) = 1
Doesn't appear to be hanging on anything obvious...
this is either a spin loop on a TCP/IP resource or VMware is trying to get access to a resource and is using the select() microtimer interface to abort the call if it blocks for more than N microseconds or so.
See above trace. Also, this trace looks identical to the other machine traces of vmware.
Does VMware and SA have any system resource in common? I am
Not directly that I can tell. But SA would naturally be interested in all resources on the system.
SA isn't hung as that the machine's stats are current also...
not familiar with how VMware operates or what it does? seems like
Virtual machine software.
something may be out of whack here, that both SA and Vmware need access to and VMware is spinning it's wheels and sucking cpu trying to gain access to that resource while SA is blocking on it, not dying in the approved timeframes and is being joined in a few minutes by the next instance of SA launched by CROND, so you eevntually collect a sleeping crowd of SA's in additon to the VMware issue.
Doesn't look like it...
Maybe an on-disk lock file left behind that an RC.d script should be removing?
Don't see any lock files...
I cant guess, I think you need to find out the actual resources that VMware and Sa are blocking on. The Wchan column in PS which is a kernel virtual menory address of the resource being waited on compared against kernel symbol table name to address map may provide the needed clue to what resource is hanging these two.
seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep cron 2415 ? root crond - 1 do_fork 2416 ? root crond - 5 schedule_timeout 2426 ? root crond - 1 do_fork 2427 ? root crond - 5 schedule_timeout 2442 ? root crond - 1 do_fork 2443 ? root crond - 5 schedule_timeout 2453 ? root crond - 1 do_fork 2454 ? root crond - 5 schedule_timeout 2455 ? root crond - 1 do_fork 2456 ? root crond - 5 schedule_timeout 2466 ? root crond - 1 do_fork 2467 ? root crond - 5 schedule_timeout 2480 ? root crond - 1 do_fork 2481 ? root crond - 5 schedule_timeout 2491 ? root crond - 1 do_fork ...
seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep vmware 1920 ? vmware start-se - 4 wait4 1942 ? vmware vmware - 0 schedule_timeout 1950 ? vmware vmware-v - 4 schedule_timeout 1951 ? vmware vmware-m - 4 schedule_timeout 1958 ? vmware vmware-v - 5 schedule_timeout 1959 ? vmware vmware-v - 5 schedule_timeout 1960 ? vmware vmware-v - 5 schedule_timeout 1961 ? vmware vmware-v - 5 schedule_timeout 2004 ? vmware vmware-m - 4 schedule_timeout 2005 ? vmware vmware-v - 5 schedule_timeout 2006 ? vmware vmware-v - 5 schedule_timeout 2007 ? vmware vmware-v - 5 schedule_timeout 2008 ? vmware vmware-v - 5 schedule_timeout 2034 ? vmware vmware-v - 4 schedule_timeout 2035 ? vmware vmware-m - 4 schedule_timeout 2036 ? vmware vmware-v - 5 schedule_timeout 2037 ? vmware vmware-v - 5 schedule_timeout 2038 ? vmware vmware-v - 5 schedule_timeout 2003 ? vmware vmware-v - 4 schedule_timeout
I think it is the NM command that dumps symbol tables but i am not sure if this is correct for Linux, but I think there is a file map under /boot that also contains a map of addresses to loader symbol names for use in dynamically loading kernel library objects
ps does the look up as it turns out...
man or info select(0 to verify microtimer feature.
It's using select to detect when fds come ready, it's not using it just as a timer.
Cove
-- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list