Thanks for the response.
On Friday, October 24, 2003, at 07:08 AM, Kenneth Goodwin wrote:
The CROND's without a PPID of 1 are the children of the master CROND process. They are trying to launch a command for you. Determine what that command is first. There is no info herein that defines that.
They all have PPID's of 1, oddly enough. Good point.
1) check the log file for CRON. See if you can determine what process is being launched here.
The only thing listed in the cron log is the system accounting shell script. When I run it manually it executes very fast, but with a high system time:
seq:~# time /usr/lib/sa/sa1 1 1 /usr/lib/sa/sa1 1 1 0.00s user 0.02s system 173% cpu 0.012 total
compared to another machine:
ack:~# time /usr/lib/sa/sa1 1 1 /usr/lib/sa/sa1 1 1 0.00s user 0.00s system 0% cpu 0.007 total
Oct 21 07:01:00 seq CROND[1533]: (root) CMD (run-parts /etc/cron.hourly) Oct 21 07:10:00 seq CROND[1545]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 07:20:00 seq CROND[1561]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 07:30:00 seq CROND[1573]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 07:40:00 seq CROND[1586]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 07:50:00 seq CROND[1603]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 08:00:00 seq CROND[1615]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 08:01:00 seq CROND[1618]: (root) CMD (run-parts /etc/cron.hourly) Oct 21 08:10:00 seq CROND[1629]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 08:20:00 seq CROND[1644]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 08:30:00 seq CROND[1656]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 08:40:00 seq CROND[1668]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 08:50:00 seq CROND[1685]: (root) CMD (/usr/lib/sa/sa1 1 1) Oct 21 09:00:00 seq CROND[1697]: (root) CMD (/usr/lib/sa/sa1 1 1)
2) dont think the problem really relates to the MTU change, but a 9000 MTU seems kinda high for gigabit. I could be wrong. I would have to check, but I vaguely recall a recommended max MTU on my Antares GE cards of around 4KB. The change probably triggered an existing bug in one of the programs cron is spawning for you. You might want to recheck that setting for correctness. MTU's are usually in multiples of 1K less the ethernet mac packet header size as I recall. Just make sure you are not setting above the maxmium limity allowed.
That's what I was thinking, maybe it triggered a bug or a legacy issue in something.
The 9000 MTU is for Jumbo Frames, which I believe are 9000 bytes and up. I was able to verify that it worked on the machine using ping and doing packet captures (tg3 driver w/ bcom cards).
See: http://sd.wareonearth.com/~phil/jumbo.html http://www.uoregon.edu/~joe/jumbo-clean-gear.html
3) a ps -ef or equivalent will show you the children and parents. do a ps -ef | grep [a crond child pid] and see if there is a child process hanging off the CROND sub child process ) using some sample PIDs from your PS below (master CROND (PID = 2415 (example), PPID = 1) -Forks-> Child CROND service process (PID = 2426 from your list, PPID = 2415) So what process if any if 2426 forking off here?
It just forks another crond, I'm guessing it's forking so it can then call exec(2), but then exec(2) is hanging for some reason. I don't know why initd is calling crond though.
Here's the tree:
# pstree -p init(1)-+-aacraid(19) |-atd(1872) |-bdflush(11) |-crond(2415)---crond(2416) |-crond(2426)---crond(2427) |-crond(2442)---crond(2443) |-crond(2453)---crond(2454) |-crond(2455)---crond(2456) |-crond(2466)---crond(2467) |-crond(2480)---crond(2481) |-crond(2491)---crond(2492) |-crond(2502)---crond(2503) |-crond(2515)---crond(2516) |-crond(2529)---crond(2530) |-crond(2531)---crond(2532) |-crond(2542)---crond(2543) |-crond(2553)---crond(2554) |-crond(2567)---crond(2568) |-crond(2578)---crond(2579) |-crond(2591)---crond(2592) |-crond(2606)---crond(2607) |-crond(2608)---crond(2609) |-crond(2619)---crond(2620) |-crond(2630)---crond(2631) |-crond(2644)---crond(2645) |-crond(2655)---crond(2656) |-crond(2668)---crond(2669) |-crond(2682)---crond(2683) |-crond(2684)---crond(2685) |-crond(2695)---crond(2696) |-crond(2706)---crond(2707) |-crond(2720)---crond(2721) |-crond(2731)---crond(2732) |-crond(2744)---crond(2745) |-crond(2758)---crond(2759) |-crond(2760)---crond(2761) |-crond(2771)---crond(2772) |-crond(2782)---crond(2783) |-crond(2796)---crond(2797) |-crond(2818)---crond(2819) |-crond(2832)---crond(2833) |-crond(2834)---crond(2835) |-crond(2845)---crond(2846) |-crond(2856)---crond(2857) |-crond(2870)---crond(2871) |-crond(2881)---crond(2882) |-crond(2892)---crond(2893) |-crond(2907)---crond(2908) |-crond(2909)---crond(2910) |-crond(2920)---crond(2921) |-crond(2931)---crond(2932) |-crond(2943)---crond(2944) |-crond(2807)---crond(2808) |-crond(9823) ...
# ps axj PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 0 1 0 0 ? -1 S 0 0:17 init ... 1 2415 1547 1547 ? -1 D 0 0:00 CROND 1 2426 1547 1547 ? -1 D 0 0:00 CROND 1 2442 1547 1547 ? -1 D 0 0:00 CROND 1 2453 1547 1547 ? -1 D 0 0:00 CROND 1 2455 1547 1547 ? -1 D 0 0:00 CROND 1 2466 1547 1547 ? -1 D 0 0:00 CROND 1 2480 1547 1547 ? -1 D 0 0:00 CROND 1 2491 1547 1547 ? -1 D 0 0:00 CROND 1 2502 1547 1547 ? -1 D 0 0:00 CROND 1 2515 1547 1547 ? -1 D 0 0:00 CROND 1 2529 1547 1547 ? -1 D 0 0:00 CROND 1 2531 1547 1547 ? -1 D 0 0:00 CROND 1 2542 1547 1547 ? -1 D 0 0:00 CROND 1 2553 1547 1547 ? -1 D 0 0:00 CROND 1 2567 1547 1547 ? -1 D 0 0:00 CROND 1 2578 1547 1547 ? -1 D 0 0:00 CROND 1 2591 1547 1547 ? -1 D 0 0:00 CROND 1 2606 1547 1547 ? -1 D 0 0:00 CROND 1 2608 1547 1547 ? -1 D 0 0:00 CROND 1 2619 1547 1547 ? -1 D 0 0:00 CROND 1 2630 1547 1547 ? -1 D 0 0:00 CROND 1 2644 1547 1547 ? -1 D 0 0:00 CROND 1 2655 1547 1547 ? -1 D 0 0:00 CROND 1 2668 1547 1547 ? -1 D 0 0:00 CROND 1 2682 1547 1547 ? -1 D 0 0:00 CROND 1 2684 1547 1547 ? -1 D 0 0:00 CROND 1 2695 1547 1547 ? -1 D 0 0:00 CROND 1 2706 1547 1547 ? -1 D 0 0:00 CROND 1 2720 1547 1547 ? -1 D 0 0:00 CROND 1 2731 1547 1547 ? -1 D 0 0:00 CROND 1 2744 1547 1547 ? -1 D 0 0:00 CROND 1 2758 1547 1547 ? -1 D 0 0:00 CROND 1 2760 1547 1547 ? -1 D 0 0:00 CROND 1 2771 1547 1547 ? -1 D 0 0:00 CROND 1 2782 1547 1547 ? -1 D 0 0:00 CROND 1 2796 1547 1547 ? -1 D 0 0:00 CROND 1 2818 1547 1547 ? -1 D 0 0:00 CROND 1 2832 1547 1547 ? -1 D 0 0:00 CROND 1 2834 1547 1547 ? -1 D 0 0:00 CROND 1 2845 1547 1547 ? -1 D 0 0:00 CROND 1 2856 1547 1547 ? -1 D 0 0:00 CROND 1 2870 1547 1547 ? -1 D 0 0:00 CROND 1 2881 1547 1547 ? -1 D 0 0:00 CROND 1 2892 1547 1547 ? -1 D 0 0:00 CROND 1 2907 1547 1547 ? -1 D 0 0:00 CROND 1 2909 1547 1547 ? -1 D 0 0:00 CROND 1 2920 1547 1547 ? -1 D 0 0:00 CROND 1 2931 1547 1547 ? -1 D 0 0:00 CROND 1 2943 1547 1547 ? -1 D 0 0:00 CROND 1 2807 1547 1547 ? -1 D 0 0:00 CROND 1 9823 9823 9823 ? -1 S 0 0:00 crond
This last entry here is the "real" crond. I've restarted it a few times, that's why the PID is so high.
4) Your real problem seems to be that VMware PID 2003 is absorbing 99% of the CPU doing "something". I dont know from your info here If this is a CRON launched process or if this is a separate process launched by INITD. But this is probably the real issue. It is probably stopping the CROND from getting enough kernel time to launch. But cant tell what is really going on here from your info. But check into the VMWARE behaviour - Log Files? I dont know about vmware so cant help you there.
My feeling is that the reported CPU usage for VMware is being skewed by the blocking CRONDs, as well as the load ave.
If you compare the machine, named seq, with the 50.0 load ave.:
seq:~# vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 50 0 49916 6652 23020 434396 0 0 13 10 11 6 2 2 8
0 50 0 49916 6652 23024 434392 0 0 0 76 215 1288 2 61 37
0 50 0 49916 6652 23024 434392 0 0 0 0 187 1190 0 60 40
0 50 0 49916 6652 23024 434392 0 0 0 0 189 1228 1 56 42
0 50 0 49916 6652 23024 434392 0 0 0 0 182 1246 1 54 45
0 50 0 49916 6652 23024 434392 0 0 0 0 191 1281 2 56 42
0 50 0 49916 6652 23024 434392 0 0 0 48 191 1264 1 52 46
0 50 0 49916 6652 23024 434392 0 0 0 0 191 1241 1 54 45
0 50 0 49916 6652 23024 434392 0 0 0 0 192 1210 0 56 44
To this machine, named ack, which is doing even more work (note the higher cs [context switches]):
ack:~# vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 43384 9580 26676 905376 0 0 5 3 2 2 1 7 7
0 0 0 43384 9564 26676 905380 0 4 4 204 462 2313 0 3 97
0 0 0 43384 9564 26676 905380 0 0 0 0 351 2311 0 2 98
0 0 0 43384 9568 26676 905400 0 0 12 4 376 2384 0 3 97
0 0 0 43384 9564 26676 905400 0 0 0 196 396 2238 0 3 97
0 0 0 43384 9568 26676 905400 0 0 0 0 336 2297 0 3 97
0 0 0 43384 9568 26676 905400 0 0 0 0 310 2267 0 1 98
0 0 0 43384 9568 26676 905400 0 0 0 4 342 2284 0 1 99
0 0 0 43384 9556 26676 905416 0 0 8 0 350 2309 0 4 96
I'd expect to find that the first machine here, seq, would be doing more work since its CPU idle time is lower. Though it's actually doing about half as much as the machine with the 90% idle time.
Here's what vmware is doing on seq:
seq:~# strace -c -T -p 2003 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.57 1.878046 4278 439 select 0.18 0.003455 8 459 gettimeofday 0.11 0.002108 211 10 write 0.05 0.000893 8 112 ioctl 0.03 0.000637 4 145 getpid 0.01 0.000239 48 5 unlink 0.01 0.000209 42 5 mknod 0.01 0.000186 19 10 open 0.01 0.000175 18 10 close 0.01 0.000113 23 5 stat64 0.00 0.000052 5 10 kill ------ ----------- ----------- --------- --------- ---------------- 100.00 1.886113 1210 total
Verses, ack (aprox. the same amount of capture time):
ack:~# strace -T -c -p 2158
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.06 4.555363 7431 613 select 0.46 0.021101 237 89 ioctl 0.23 0.010421 197 53 write 0.19 0.008532 14 610 gettimeofday 0.02 0.001068 7 146 getpid 0.01 0.000334 334 1 statfs 0.01 0.000319 8 38 3 read 0.01 0.000306 51 6 unlink 0.01 0.000267 45 6 mknod 0.01 0.000265 22 12 close 0.01 0.000264 22 12 open 0.00 0.000144 29 5 stat64 0.00 0.000052 7 8 kill ------ ----------- ----------- --------- --------- ---------------- 100.00 4.598436 1599 3 total
I donno... Any ideas? I can't do a strace on the CROND either, it's blocking too hard to even see what call it's blocking on.
Thanks,
Cove
-- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list