EMBEDDED RESPONSE > -----Original Message----- > From: redhat-list-admin@xxxxxxxxxx > [mailto:redhat-list-admin@xxxxxxxxxx]On Behalf Of Cove Schneider > Sent: Friday, October 24, 2003 3:24 PM > To: redhat-list@xxxxxxxxxx > Subject: Re: Many instances of CROND running, load raised (RH9) > > > Hi Kenneth, > > Thanks for the response. > > On Friday, October 24, 2003, at 07:08 AM, Kenneth Goodwin wrote: > > > The CROND's without a PPID of 1 are the children of the > > master CROND process. > > They are trying to launch a command for you. Determine what > > that command is first. > > There is no info herein that defines that. > > They all have PPID's of 1, oddly enough. Good point. wELL THAT IS NOT RIGHT, THAT probably means either your system got a major issue here or more likley they died already, are orphaned from their Parent CROND and are now inherited by INITD (normal for this situation) and INITD has not reaped the dead children yet probably because VMWARE is sucking up all the cpu cylces. > > > 1) check the log file for CRON. See if you can determine > > what process > > is being launched here. > > > > The only thing listed in the cron log is the system accounting shell > script. When I run it manually it executes very fast, but > with a high > system time: > > seq:~# time /usr/lib/sa/sa1 1 1 > /usr/lib/sa/sa1 1 1 0.00s user 0.02s system 173% cpu 0.012 total Is vmware running full throttle will you are doing this? I think it is the core part of your issue. What happens to SA1 times iof you shutdown VMWARE and then run it? Whats the difference in elasped real time clock run times between sa1 on the affected systems and the sa1 on the baseline system? > > compared to another machine: > > ack:~# time /usr/lib/sa/sa1 1 1 > /usr/lib/sa/sa1 1 1 0.00s user 0.00s system 0% cpu 0.007 total > > Oct 21 07:01:00 seq CROND[1533]: (root) CMD (run-parts > /etc/cron.hourly) > Oct 21 07:10:00 seq CROND[1545]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 07:20:00 seq CROND[1561]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 07:30:00 seq CROND[1573]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 07:40:00 seq CROND[1586]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 07:50:00 seq CROND[1603]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 08:00:00 seq CROND[1615]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 08:01:00 seq CROND[1618]: (root) CMD (run-parts > /etc/cron.hourly) > Oct 21 08:10:00 seq CROND[1629]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 08:20:00 seq CROND[1644]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 08:30:00 seq CROND[1656]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 08:40:00 seq CROND[1668]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 08:50:00 seq CROND[1685]: (root) CMD (/usr/lib/sa/sa1 1 1) > Oct 21 09:00:00 seq CROND[1697]: (root) CMD (/usr/lib/sa/sa1 1 1) > > > > 2) dont think the problem really relates to the MTU change, > > but a 9000 > > MTU seems kinda high for gigabit. I could be wrong. > > I would have to check, but I vaguely > > recall a recommended max MTU on my Antares GE cards of > > around 4KB. > > The change probably triggered an existing bug in one of the > > programs cron is spawning for you. You might want to recheck > > that setting for correctness. MTU's are usually > > in multiples of 1K less the ethernet mac packet header size > > as I recall. Just make sure you are not setting above the > > maxmium limity allowed. > > That's what I was thinking, maybe it triggered a bug or a > legacy issue > in something. > > The 9000 MTU is for Jumbo Frames, which I believe are 9000 bytes and > up. I was able to verify that it worked on the machine using > ping and > doing packet captures (tg3 driver w/ bcom cards). > > See: > http://sd.wareonearth.com/~phil/jumbo.html > http://www.uoregon.edu/~joe/jumbo-clean-gear.html > > > 3) a ps -ef or equivalent will show you the children and > > parents. > > do a ps -ef | grep [a crond child pid] and see if there is a > > child process > > hanging off the CROND sub child process ) > > using some sample PIDs from your PS below > > (master CROND (PID = 2415 (example), PPID = 1) -Forks-> > > Child CROND service process (PID = 2426 from your list, PPID > > = 2415) > > So what process if any if 2426 forking off here? > > It just forks another crond, I'm guessing it's forking so it > can then > call exec(2), but then exec(2) is hanging for some reason. I > don't know > why initd is calling crond though. Initd strats up all the sub processes in the system in the RC scripts It is the Master parent process and should be the PPID for the INITIALLY launched CROND daemon only. It becomes the PPID of children that are orphaned by their parents in a weird way. INITD->CROND->CRONDSUB->YOURCRONTASK if the CRONDSUB process does a fork and exec and wait cycle on YOURCRONTASK instead of a straight EXEC call to replace itself with your process, which is probably the case sinc ethe CRONSUB probably handles the logging in the cron log file as well, then if the CRONDSUB process dies before YOURCRONTASK dies, then the new PPID of YOURCRONTASK will become INITD. I believe MASTER CROND is forking it's children CRINDSUB's as separate process groups to protect itself against stray signalling to the proces group that might kill it off. The crond forking crond are weird, suspect the second one (2416) off of the first 2415 is the actual entry from your crontab being launched and based on your statements should actually be showing up as SA1 at this point. you could Shutdown the calls to SA1 and the rest of the accountiong package in the cron files, resubmit the root cronn table and see if the problem follows SA1 and goes away. If it does, you will need to tear apart the SA package to see what is wrong, maybe a corrupted data or control file or SA cant handle accounting records for TCP/IP with super large frame sizes (code oversights) if it is monitoring that subsystem as well as the disks > > Here's the tree: > > # pstree -p > init(1)-+-aacraid(19) > |-atd(1872) > |-bdflush(11) > |-crond(2415)---crond(2416) > |-crond(2426)---crond(2427) > |-crond(2442)---crond(2443) > |-crond(2453)---crond(2454) > |-crond(2455)---crond(2456) > |-crond(2466)---crond(2467) > |-crond(2480)---crond(2481) > |-crond(2491)---crond(2492) > |-crond(2502)---crond(2503) > |-crond(2515)---crond(2516) > |-crond(2529)---crond(2530) > |-crond(2531)---crond(2532) > |-crond(2542)---crond(2543) > |-crond(2553)---crond(2554) > |-crond(2567)---crond(2568) > |-crond(2578)---crond(2579) > |-crond(2591)---crond(2592) > |-crond(2606)---crond(2607) > |-crond(2608)---crond(2609) > |-crond(2619)---crond(2620) > |-crond(2630)---crond(2631) > |-crond(2644)---crond(2645) > |-crond(2655)---crond(2656) > |-crond(2668)---crond(2669) > |-crond(2682)---crond(2683) > |-crond(2684)---crond(2685) > |-crond(2695)---crond(2696) > |-crond(2706)---crond(2707) > |-crond(2720)---crond(2721) > |-crond(2731)---crond(2732) > |-crond(2744)---crond(2745) > |-crond(2758)---crond(2759) > |-crond(2760)---crond(2761) > |-crond(2771)---crond(2772) > |-crond(2782)---crond(2783) > |-crond(2796)---crond(2797) > |-crond(2818)---crond(2819) > |-crond(2832)---crond(2833) > |-crond(2834)---crond(2835) > |-crond(2845)---crond(2846) > |-crond(2856)---crond(2857) > |-crond(2870)---crond(2871) > |-crond(2881)---crond(2882) > |-crond(2892)---crond(2893) > |-crond(2907)---crond(2908) > |-crond(2909)---crond(2910) > |-crond(2920)---crond(2921) > |-crond(2931)---crond(2932) > |-crond(2943)---crond(2944) > |-crond(2807)---crond(2808) > |-crond(9823) > ... > > # ps axj > PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND > 0 1 0 0 ? -1 S 0 0:17 init > ... > 1 2415 1547 1547 ? -1 D 0 0:00 CROND > 1 2426 1547 1547 ? -1 D 0 0:00 CROND > 1 2442 1547 1547 ? -1 D 0 0:00 CROND > 1 2453 1547 1547 ? -1 D 0 0:00 CROND > 1 2455 1547 1547 ? -1 D 0 0:00 CROND > 1 2466 1547 1547 ? -1 D 0 0:00 CROND > 1 2480 1547 1547 ? -1 D 0 0:00 CROND > 1 2491 1547 1547 ? -1 D 0 0:00 CROND > 1 2502 1547 1547 ? -1 D 0 0:00 CROND > 1 2515 1547 1547 ? -1 D 0 0:00 CROND > 1 2529 1547 1547 ? -1 D 0 0:00 CROND > 1 2531 1547 1547 ? -1 D 0 0:00 CROND > 1 2542 1547 1547 ? -1 D 0 0:00 CROND > 1 2553 1547 1547 ? -1 D 0 0:00 CROND > 1 2567 1547 1547 ? -1 D 0 0:00 CROND > 1 2578 1547 1547 ? -1 D 0 0:00 CROND > 1 2591 1547 1547 ? -1 D 0 0:00 CROND > 1 2606 1547 1547 ? -1 D 0 0:00 CROND > 1 2608 1547 1547 ? -1 D 0 0:00 CROND > 1 2619 1547 1547 ? -1 D 0 0:00 CROND > 1 2630 1547 1547 ? -1 D 0 0:00 CROND > 1 2644 1547 1547 ? -1 D 0 0:00 CROND > 1 2655 1547 1547 ? -1 D 0 0:00 CROND > 1 2668 1547 1547 ? -1 D 0 0:00 CROND > 1 2682 1547 1547 ? -1 D 0 0:00 CROND > 1 2684 1547 1547 ? -1 D 0 0:00 CROND > 1 2695 1547 1547 ? -1 D 0 0:00 CROND > 1 2706 1547 1547 ? -1 D 0 0:00 CROND > 1 2720 1547 1547 ? -1 D 0 0:00 CROND > 1 2731 1547 1547 ? -1 D 0 0:00 CROND > 1 2744 1547 1547 ? -1 D 0 0:00 CROND > 1 2758 1547 1547 ? -1 D 0 0:00 CROND > 1 2760 1547 1547 ? -1 D 0 0:00 CROND > 1 2771 1547 1547 ? -1 D 0 0:00 CROND > 1 2782 1547 1547 ? -1 D 0 0:00 CROND > 1 2796 1547 1547 ? -1 D 0 0:00 CROND > 1 2818 1547 1547 ? -1 D 0 0:00 CROND > 1 2832 1547 1547 ? -1 D 0 0:00 CROND > 1 2834 1547 1547 ? -1 D 0 0:00 CROND > 1 2845 1547 1547 ? -1 D 0 0:00 CROND > 1 2856 1547 1547 ? -1 D 0 0:00 CROND > 1 2870 1547 1547 ? -1 D 0 0:00 CROND > 1 2881 1547 1547 ? -1 D 0 0:00 CROND > 1 2892 1547 1547 ? -1 D 0 0:00 CROND > 1 2907 1547 1547 ? -1 D 0 0:00 CROND > 1 2909 1547 1547 ? -1 D 0 0:00 CROND > 1 2920 1547 1547 ? -1 D 0 0:00 CROND > 1 2931 1547 1547 ? -1 D 0 0:00 CROND > 1 2943 1547 1547 ? -1 D 0 0:00 CROND > 1 2807 1547 1547 ? -1 D 0 0:00 CROND > 1 9823 9823 9823 ? -1 S 0 0:00 crond > > This last entry here is the "real" crond. I've restarted it a few > times, that's why the PID is so high. > > > 4) Your real problem seems to be that VMware PID 2003 is > > absorbing > > 99% of the CPU doing "something". I dont know from your info > > here > > If this is a CRON launched process or if this is a separate > > process > > launched by INITD. But this is probably the real issue. It > > is probably > > stopping the CROND from getting enough kernel time to > > launch. But cant > > tell what is really going on here from your info. But check > > into the VMWARE > > behaviour - Log Files? I dont know about vmware so cant help > > you there. > > My feeling is that the reported CPU usage for VMware is > being skewed by > the blocking CRONDs, as well as the load ave. No I think VMware is the issue, the CROND's seem to be dead children dont know why they are spawning like this, you list does not show start times it could be that they are normally launched 5 minute interval SA's that are all hanging asleep in some deadlock state on a resource VMware has locked. I would focus on why VMware is misbehaving here, it's eating the CPU alive. > > If you compare the machine, named seq, with the 50.0 load ave.: > > seq:~# vmstat 1 > procs memory swap io > system > cpu > r b w swpd free buff cache si so bi bo > in cs > us sy id > 1 50 0 49916 6652 23020 434396 0 0 13 10 > 11 6 > 2 2 8 > 0 50 0 49916 6652 23024 434392 0 0 0 76 > 215 1288 > 2 61 37 > 0 50 0 49916 6652 23024 434392 0 0 0 0 > 187 1190 > 0 60 40 > 0 50 0 49916 6652 23024 434392 0 0 0 0 > 189 1228 > 1 56 42 > 0 50 0 49916 6652 23024 434392 0 0 0 0 > 182 1246 > 1 54 45 > 0 50 0 49916 6652 23024 434392 0 0 0 0 > 191 1281 > 2 56 42 > 0 50 0 49916 6652 23024 434392 0 0 0 48 > 191 1264 > 1 52 46 > 0 50 0 49916 6652 23024 434392 0 0 0 0 > 191 1241 > 1 54 45 > 0 50 0 49916 6652 23024 434392 0 0 0 0 > 192 1210 > 0 56 44 > > To this machine, named ack, which is doing even more work (note the > higher cs [context switches]): Processes going in and out of run state basically, could be VMware in a high speed loop blocking on something for a microsecond and then waking up on a alarm timeout (one of the functions of select() as I recall is to also control fine granularity microtimers() aND ALARMS AS WELL as await events on TCP/IP sockets. > > ack:~# vmstat 1 > procs memory swap io > system > cpu > r b w swpd free buff cache si so bi bo > in cs > us sy id > 0 0 0 43384 9580 26676 905376 0 0 5 3 > 2 2 > 1 7 7 > 0 0 0 43384 9564 26676 905380 0 4 4 204 > 462 2313 > 0 3 97 > 0 0 0 43384 9564 26676 905380 0 0 0 0 > 351 2311 > 0 2 98 > 0 0 0 43384 9568 26676 905400 0 0 12 4 > 376 2384 > 0 3 97 > 0 0 0 43384 9564 26676 905400 0 0 0 196 > 396 2238 > 0 3 97 > 0 0 0 43384 9568 26676 905400 0 0 0 0 > 336 2297 > 0 3 97 > 0 0 0 43384 9568 26676 905400 0 0 0 0 > 310 2267 > 0 1 98 > 0 0 0 43384 9568 26676 905400 0 0 0 4 > 342 2284 > 0 1 99 > 0 0 0 43384 9556 26676 905416 0 0 8 0 > 350 2309 > 0 4 96 > > I'd expect to find that the first machine here, seq, would be doing > more work since its CPU idle time is lower. Though it's > actually doing > about half as much as the machine with the 90% idle time. > > Here's what vmware is doing on seq: > > seq:~# strace -c -T -p 2003 > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ------ ---------- > 99.57 1.878046 4278 439 select this is either a spin loop on a TCP/IP resource or VMware is trying to get access to a resource and is using the select() microtimer interface to abort the call if it blocks for more than N microseconds or so. Does VMware and SA have any system resource in common? I am not familiar with how VMware operates or what it does? seems like something may be out of whack here, that both SA and Vmware need access to and VMware is spinning it's wheels and sucking cpu trying to gain access to that resource while SA is blocking on it, not dying in the approved timeframes and is being joined in a few minutes by the next instance of SA launched by CROND, so you eevntually collect a sleeping crowd of SA's in additon to the VMware issue. Maybe an on-disk lock file left behind that an RC.d script should be removing? I cant guess, I think you need to find out the actual resources that VMware and Sa are blocking on. The Wchan column in PS which is a kernel virtual menory address of the resource being waited on compared against kernel symbol table name to address map may provide the needed clue to what resource is hanging these two. I think it is the NM command that dumps symbol tables but i am not sure if this is correct for Linux, but I think there is a file map under /boot that also contains a map of addresses to loader symbol names for use in dynamically loading kernel library objects > 0.18 0.003455 8 459 gettimeofday > 0.11 0.002108 211 10 write > 0.05 0.000893 8 112 ioctl > 0.03 0.000637 4 145 getpid > 0.01 0.000239 48 5 unlink > 0.01 0.000209 42 5 mknod > 0.01 0.000186 19 10 open > 0.01 0.000175 18 10 close > 0.01 0.000113 23 5 stat64 > 0.00 0.000052 5 10 kill > ------ ----------- ----------- --------- --------- ------ ---------- > 100.00 1.886113 1210 total > > Verses, ack (aprox. the same amount of capture time): > > ack:~# strace -T -c -p 2158 > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ------ ---------- > 99.06 4.555363 7431 613 select > 0.46 0.021101 237 89 ioctl > 0.23 0.010421 197 53 write > 0.19 0.008532 14 610 gettimeofday > 0.02 0.001068 7 146 getpid > 0.01 0.000334 334 1 statfs > 0.01 0.000319 8 38 3 read > 0.01 0.000306 51 6 unlink > 0.01 0.000267 45 6 mknod > 0.01 0.000265 22 12 close > 0.01 0.000264 22 12 open > 0.00 0.000144 29 5 stat64 > 0.00 0.000052 7 8 kill > ------ ----------- ----------- --------- --------- ------ ---------- > 100.00 4.598436 1599 3 total man or info select(0 to verify microtimer feature. > > I donno... Any ideas? I can't do a strace on the CROND either, it's > blocking too hard to even see what call it's blocking on. > > Thanks, > > Cove > > > -- > redhat-list mailing list > unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe > https://www.redhat.com/mailman/listinfo/redhat-list > -- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list