> I restarted the crond, so this may be why they all have PPIDs of 1. Possibly, you would have orphaned anything that did not obey or blocked the SIGKILL on the process group. > > Though vmware isn't sucking up _all_ of the CPU cycles on > the machine, > top reports that one CPU is hardly being used: One process can only suck the life of out a single cpu. However, VMWARE may be like Oracle processes - runs alls the time eats cpu's, this might be it's normal expected behaviour. I dont know personnally as I dont know this software. > > I can't shutdown vmware right now, as that it's in production. okay. can not take it out of the picture, so you will have to relay on intuitive logical analysis. (aka, guess what could be in conflict if anything between the various processes) > >> > >> It just forks another crond, I'm guessing it's forking so > > it can then call exec(2), but then exec(2) is hanging for some > > reason. I don't know why initd is calling crond though. INITD is it's real parent, INITD should be the parent of all startup processes as it starts them all through the RC scripts. The only exception to this rule is master processes that are launched by INITD that then do a FORK() and CHANGE PGRP() sequence to become there own master process group. However if this master process thies leaving orphans it's "unreaped" children will still become wards of the state, aka INITD. ExEC() wont hang, the program will die on an exec failure. You are either in a loop waiting on a resource to come free or Sleep Blocked waiting on that resource. But this does not make sense in a normal SA environment unless VMWARE does disk drive locks as a semaphore and SA is obeying a locking protocol for disk access and hanging a VMWARE disk it should not be even looking at. (MY SUN HA Cluster uses an entire 9GB hard drive as a Inter-system exclusion lock. (what a waste of good disk) > > the SA package to see what is wrong, maybe a corrupted data > > or control file > > or SA cant handle accounting records for TCP/IP with super > > large frame sizes > > (code oversights) if it is monitoring that subsystem as well > > as the disks > > > > This seems reasonable, I'll look though the code a bit. > > >> > seq:~# ps aux|grep CRON|wc -l > 51 > seq:~# uptime > 14:41:23 up 14 days, 3:54, 1 user, load average: > 51.21, 51.16, > 51.10 > > It seems suspicious to me that the load is 51.x on the > system and there > are exactly 51 CRONDs stuck. That would mean that the CRONDS/SA's are actually running in small increments of time and should show up in TOP as a running process at some short interval. > > > (NB: VMware is a virtual machine emulator. It runs entirely in user > land in our case, no LKMs etc.) That probably explains it's high CPU usage, never blocks on an event, constantly polls for new work that is never there, eats CPU cycles on idle machines doing nothing - sounds familiar - SUN WABI and Oracle are major examples of such bad coding. > > Note timeouts, one is at .1 and the other at .2. Two loops > I'd imagine. > > seq:~# strace -p 2003 > gettimeofday({1067028221, 648861}, NULL) = 0 > select(31, [7 8 9 10 11 12 14 16 17 19 20 22 23 25 27 29 > 30], [], NULL, > {0, 10000}) = 1 (in [14], left {0, 0}) > gettimeofday({1067028221, 656734}, NULL) = 0 > select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], > [], NULL, > {0, 20000}) = 0 (Timeout) > gettimeofday({1067028221, 675903}, NULL) = 0 > > select(31, [7 8 9 10 11 12 16 17 19 20 22 23 25 27 29 30], > [], NULL, > {0, 20000}) = 1 (in [12], left {0, 20000}) > ioctl(12, 0xdf, 0) = 0 I dont remember your PID's to Process name table But this would appear to be a micro-timer sleep loop with someone checking the time when the timer goes off. Might be the CROND master process you are ptracing. the sleep system call might be implemented in Linux with the select microtimer facility. correction - appears to be vmware from your stuff below. > > This open(2) then unlink(2) is just a security "trick" to > prevent the > file from being seen on the file system... > > Doesn't appear to be hanging on anything obvious... If this is vmware, thats correct, vmware is not hanging it master service process is is an infinite loop probably a normal behaviour. > See above trace. Also, this trace looks identical to the > other machine > traces of vmware. Then VMWARE is behaving normally for itself and should not be related to your SA issue. back to square one. > > > > > seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep cron > 2415 ? root crond - 1 do_fork > 2416 ? root crond - 5 schedule_timeout > 2426 ? root crond - 1 do_fork > 2427 ? root crond - 5 schedule_timeout > 2442 ? root crond - 1 do_fork > 2443 ? root crond - 5 schedule_timeout > 2453 ? root crond - 1 do_fork > 2454 ? root crond - 5 schedule_timeout > 2455 ? root crond - 1 do_fork > 2456 ? root crond - 5 schedule_timeout > 2466 ? root crond - 1 do_fork > 2467 ? root crond - 5 schedule_timeout > 2480 ? root crond - 1 do_fork > 2481 ? root crond - 5 schedule_timeout > 2491 ? root crond - 1 do_fork This sort of looks like you have multiple master or submaster CROND's running. The children never do sleep timeouts normally it is normal FORK --> EXEC -> new process image. So either you have a bug in the CROND launch script or some glich that forked off multiple CRONDS parents in the RC script startup OR have a watchdog script that is supposed to check for crond and restart it if it dies and That watchdog script is broken and keeps starting up new CROND masters OR the subchild crond is in a infinite fork-exec loop trying to exec() a process out of the crontab file and failing. > > seq:~# ps -eo pid,tt,user,fname,tmout,f,wchan|grep vmware > 1920 ? vmware start-se - 4 wait4 > 1942 ? vmware vmware - 0 schedule_timeout > 1950 ? vmware vmware-v - 4 schedule_timeout > 1951 ? vmware vmware-m - 4 schedule_timeout > 1958 ? vmware vmware-v - 5 schedule_timeout > 1959 ? vmware vmware-v - 5 schedule_timeout > 1960 ? vmware vmware-v - 5 schedule_timeout > 1961 ? vmware vmware-v - 5 schedule_timeout > 2004 ? vmware vmware-m - 4 schedule_timeout > 2005 ? vmware vmware-v - 5 schedule_timeout > 2006 ? vmware vmware-v - 5 schedule_timeout > 2007 ? vmware vmware-v - 5 schedule_timeout > 2008 ? vmware vmware-v - 5 schedule_timeout > 2034 ? vmware vmware-v - 4 schedule_timeout > 2035 ? vmware vmware-m - 4 schedule_timeout > 2036 ? vmware vmware-v - 5 schedule_timeout > 2037 ? vmware vmware-v - 5 schedule_timeout > 2038 ? vmware vmware-v - 5 schedule_timeout > 2003 ? vmware vmware-v - 4 schedule_timeout I would guess that VMWARE has a master control thread and then service threads/subprocesses that actually do the processing this is how informix and oracle work and it is a similiar program. -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list