Re: Many instances of CROND running, load raised (RH9)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kenneth,

Thanks for the response.

On Friday, October 24, 2003, at 07:08 AM, Kenneth Goodwin wrote:

The CROND's without a PPID of 1 are the children of the
master CROND process.
They are trying to launch a command for you. Determine what
that command is first.
There is no info herein that defines that.

They all have PPID's of 1, oddly enough. Good point.


1) check the log file for CRON. See if you can determine
what process
is being launched here.


The only thing listed in the cron log is the system accounting shell script. When I run it manually it executes very fast, but with a high system time:


seq:~# time /usr/lib/sa/sa1 1 1
/usr/lib/sa/sa1 1 1  0.00s user 0.02s system 173% cpu 0.012 total

compared to another machine:

ack:~# time /usr/lib/sa/sa1 1 1
/usr/lib/sa/sa1 1 1  0.00s user 0.00s system 0% cpu 0.007 total

Oct 21 07:01:00 seq CROND[1533]: (root) CMD (run-parts /etc/cron.hourly)
Oct 21 07:10:00 seq CROND[1545]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:20:00 seq CROND[1561]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:30:00 seq CROND[1573]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:40:00 seq CROND[1586]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:50:00 seq CROND[1603]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:00:00 seq CROND[1615]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:01:00 seq CROND[1618]: (root) CMD (run-parts /etc/cron.hourly)
Oct 21 08:10:00 seq CROND[1629]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:20:00 seq CROND[1644]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:30:00 seq CROND[1656]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:40:00 seq CROND[1668]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:50:00 seq CROND[1685]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 09:00:00 seq CROND[1697]: (root) CMD (/usr/lib/sa/sa1 1 1)


2) dont think the problem really relates to the MTU change,
but a 9000
MTU seems kinda high for gigabit. I could be  wrong.
I would have to check, but I vaguely
recall a recommended max MTU on my Antares GE cards of
around 4KB.
The change probably triggered an existing bug in one of the
programs cron is spawning for you. You might want to recheck
that setting for correctness. MTU's are usually
in multiples of 1K less the ethernet mac packet header size
as I recall. Just make sure you are not setting above the
maxmium limity allowed.

That's what I was thinking, maybe it triggered a bug or a legacy issue in something.


The 9000 MTU is for Jumbo Frames, which I believe are 9000 bytes and up. I was able to verify that it worked on the machine using ping and doing packet captures (tg3 driver w/ bcom cards).

See:
http://sd.wareonearth.com/~phil/jumbo.html
http://www.uoregon.edu/~joe/jumbo-clean-gear.html

3) a ps -ef or equivalent will show you the children and
parents.
do a ps -ef | grep [a crond child pid] and see if there is a
child process
hanging off the CROND sub child process )
using some sample PIDs from your PS below
 (master CROND (PID = 2415 (example), PPID = 1) -Forks->
Child CROND service process (PID = 2426 from your list, PPID
= 2415)
So what process if any if 2426 forking off here?

It just forks another crond, I'm guessing it's forking so it can then call exec(2), but then exec(2) is hanging for some reason. I don't know why initd is calling crond though.


Here's the tree:

# pstree -p
init(1)-+-aacraid(19)
        |-atd(1872)
        |-bdflush(11)
        |-crond(2415)---crond(2416)
        |-crond(2426)---crond(2427)
        |-crond(2442)---crond(2443)
        |-crond(2453)---crond(2454)
        |-crond(2455)---crond(2456)
        |-crond(2466)---crond(2467)
        |-crond(2480)---crond(2481)
        |-crond(2491)---crond(2492)
        |-crond(2502)---crond(2503)
        |-crond(2515)---crond(2516)
        |-crond(2529)---crond(2530)
        |-crond(2531)---crond(2532)
        |-crond(2542)---crond(2543)
        |-crond(2553)---crond(2554)
        |-crond(2567)---crond(2568)
        |-crond(2578)---crond(2579)
        |-crond(2591)---crond(2592)
        |-crond(2606)---crond(2607)
        |-crond(2608)---crond(2609)
        |-crond(2619)---crond(2620)
        |-crond(2630)---crond(2631)
        |-crond(2644)---crond(2645)
        |-crond(2655)---crond(2656)
        |-crond(2668)---crond(2669)
        |-crond(2682)---crond(2683)
        |-crond(2684)---crond(2685)
        |-crond(2695)---crond(2696)
        |-crond(2706)---crond(2707)
        |-crond(2720)---crond(2721)
        |-crond(2731)---crond(2732)
        |-crond(2744)---crond(2745)
        |-crond(2758)---crond(2759)
        |-crond(2760)---crond(2761)
        |-crond(2771)---crond(2772)
        |-crond(2782)---crond(2783)
        |-crond(2796)---crond(2797)
        |-crond(2818)---crond(2819)
        |-crond(2832)---crond(2833)
        |-crond(2834)---crond(2835)
        |-crond(2845)---crond(2846)
        |-crond(2856)---crond(2857)
        |-crond(2870)---crond(2871)
        |-crond(2881)---crond(2882)
        |-crond(2892)---crond(2893)
        |-crond(2907)---crond(2908)
        |-crond(2909)---crond(2910)
        |-crond(2920)---crond(2921)
        |-crond(2931)---crond(2932)
        |-crond(2943)---crond(2944)
        |-crond(2807)---crond(2808)
        |-crond(9823)
...

# ps axj
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    0     1     0     0 ?           -1 S        0   0:17 init
...
    1  2415  1547  1547 ?           -1 D        0   0:00 CROND
    1  2426  1547  1547 ?           -1 D        0   0:00 CROND
    1  2442  1547  1547 ?           -1 D        0   0:00 CROND
    1  2453  1547  1547 ?           -1 D        0   0:00 CROND
    1  2455  1547  1547 ?           -1 D        0   0:00 CROND
    1  2466  1547  1547 ?           -1 D        0   0:00 CROND
    1  2480  1547  1547 ?           -1 D        0   0:00 CROND
    1  2491  1547  1547 ?           -1 D        0   0:00 CROND
    1  2502  1547  1547 ?           -1 D        0   0:00 CROND
    1  2515  1547  1547 ?           -1 D        0   0:00 CROND
    1  2529  1547  1547 ?           -1 D        0   0:00 CROND
    1  2531  1547  1547 ?           -1 D        0   0:00 CROND
    1  2542  1547  1547 ?           -1 D        0   0:00 CROND
    1  2553  1547  1547 ?           -1 D        0   0:00 CROND
    1  2567  1547  1547 ?           -1 D        0   0:00 CROND
    1  2578  1547  1547 ?           -1 D        0   0:00 CROND
    1  2591  1547  1547 ?           -1 D        0   0:00 CROND
    1  2606  1547  1547 ?           -1 D        0   0:00 CROND
    1  2608  1547  1547 ?           -1 D        0   0:00 CROND
    1  2619  1547  1547 ?           -1 D        0   0:00 CROND
    1  2630  1547  1547 ?           -1 D        0   0:00 CROND
    1  2644  1547  1547 ?           -1 D        0   0:00 CROND
    1  2655  1547  1547 ?           -1 D        0   0:00 CROND
    1  2668  1547  1547 ?           -1 D        0   0:00 CROND
    1  2682  1547  1547 ?           -1 D        0   0:00 CROND
    1  2684  1547  1547 ?           -1 D        0   0:00 CROND
    1  2695  1547  1547 ?           -1 D        0   0:00 CROND
    1  2706  1547  1547 ?           -1 D        0   0:00 CROND
    1  2720  1547  1547 ?           -1 D        0   0:00 CROND
    1  2731  1547  1547 ?           -1 D        0   0:00 CROND
    1  2744  1547  1547 ?           -1 D        0   0:00 CROND
    1  2758  1547  1547 ?           -1 D        0   0:00 CROND
    1  2760  1547  1547 ?           -1 D        0   0:00 CROND
    1  2771  1547  1547 ?           -1 D        0   0:00 CROND
    1  2782  1547  1547 ?           -1 D        0   0:00 CROND
    1  2796  1547  1547 ?           -1 D        0   0:00 CROND
    1  2818  1547  1547 ?           -1 D        0   0:00 CROND
    1  2832  1547  1547 ?           -1 D        0   0:00 CROND
    1  2834  1547  1547 ?           -1 D        0   0:00 CROND
    1  2845  1547  1547 ?           -1 D        0   0:00 CROND
    1  2856  1547  1547 ?           -1 D        0   0:00 CROND
    1  2870  1547  1547 ?           -1 D        0   0:00 CROND
    1  2881  1547  1547 ?           -1 D        0   0:00 CROND
    1  2892  1547  1547 ?           -1 D        0   0:00 CROND
    1  2907  1547  1547 ?           -1 D        0   0:00 CROND
    1  2909  1547  1547 ?           -1 D        0   0:00 CROND
    1  2920  1547  1547 ?           -1 D        0   0:00 CROND
    1  2931  1547  1547 ?           -1 D        0   0:00 CROND
    1  2943  1547  1547 ?           -1 D        0   0:00 CROND
    1  2807  1547  1547 ?           -1 D        0   0:00 CROND
    1  9823  9823  9823 ?           -1 S        0   0:00 crond

This last entry here is the "real" crond. I've restarted it a few times, that's why the PID is so high.

4) Your real problem seems to be that VMware PID 2003 is
absorbing
99% of the CPU doing "something". I dont know from your info
here
If this is a CRON launched process or if this is a separate
process
launched by INITD. But this is probably the real issue. It
is probably
stopping the CROND from getting enough kernel time to
launch. But cant
tell what is really going on here from your info.  But check
into the VMWARE
behaviour - Log Files? I dont know about vmware so cant help
you there.

My feeling is that the reported CPU usage for VMware is being skewed by the blocking CRONDs, as well as the load ave.


If you compare the machine, named seq, with the 50.0 load ave.:

seq:~# vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 50 0 49916 6652 23020 434396 0 0 13 10 11 6 2 2 8
0 50 0 49916 6652 23024 434392 0 0 0 76 215 1288 2 61 37
0 50 0 49916 6652 23024 434392 0 0 0 0 187 1190 0 60 40
0 50 0 49916 6652 23024 434392 0 0 0 0 189 1228 1 56 42
0 50 0 49916 6652 23024 434392 0 0 0 0 182 1246 1 54 45
0 50 0 49916 6652 23024 434392 0 0 0 0 191 1281 2 56 42
0 50 0 49916 6652 23024 434392 0 0 0 48 191 1264 1 52 46
0 50 0 49916 6652 23024 434392 0 0 0 0 191 1241 1 54 45
0 50 0 49916 6652 23024 434392 0 0 0 0 192 1210 0 56 44


To this machine, named ack, which is doing even more work (note the higher cs [context switches]):

ack:~# vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 43384 9580 26676 905376 0 0 5 3 2 2 1 7 7
0 0 0 43384 9564 26676 905380 0 4 4 204 462 2313 0 3 97
0 0 0 43384 9564 26676 905380 0 0 0 0 351 2311 0 2 98
0 0 0 43384 9568 26676 905400 0 0 12 4 376 2384 0 3 97
0 0 0 43384 9564 26676 905400 0 0 0 196 396 2238 0 3 97
0 0 0 43384 9568 26676 905400 0 0 0 0 336 2297 0 3 97
0 0 0 43384 9568 26676 905400 0 0 0 0 310 2267 0 1 98
0 0 0 43384 9568 26676 905400 0 0 0 4 342 2284 0 1 99
0 0 0 43384 9556 26676 905416 0 0 8 0 350 2309 0 4 96


I'd expect to find that the first machine here, seq, would be doing more work since its CPU idle time is lower. Though it's actually doing about half as much as the machine with the 90% idle time.

Here's what vmware is doing on seq:

seq:~# strace -c -T -p 2003
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.57    1.878046        4278       439           select
  0.18    0.003455           8       459           gettimeofday
  0.11    0.002108         211        10           write
  0.05    0.000893           8       112           ioctl
  0.03    0.000637           4       145           getpid
  0.01    0.000239          48         5           unlink
  0.01    0.000209          42         5           mknod
  0.01    0.000186          19        10           open
  0.01    0.000175          18        10           close
  0.01    0.000113          23         5           stat64
  0.00    0.000052           5        10           kill
------ ----------- ----------- --------- --------- ----------------
100.00    1.886113                  1210           total

Verses, ack (aprox. the same amount of capture time):

ack:~# strace -T -c -p 2158

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.06    4.555363        7431       613           select
  0.46    0.021101         237        89           ioctl
  0.23    0.010421         197        53           write
  0.19    0.008532          14       610           gettimeofday
  0.02    0.001068           7       146           getpid
  0.01    0.000334         334         1           statfs
  0.01    0.000319           8        38         3 read
  0.01    0.000306          51         6           unlink
  0.01    0.000267          45         6           mknod
  0.01    0.000265          22        12           close
  0.01    0.000264          22        12           open
  0.00    0.000144          29         5           stat64
  0.00    0.000052           7         8           kill
------ ----------- ----------- --------- --------- ----------------
100.00    4.598436                  1599         3 total

I donno... Any ideas? I can't do a strace on the CROND either, it's blocking too hard to even see what call it's blocking on.

Thanks,

Cove


-- redhat-list mailing list unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe https://www.redhat.com/mailman/listinfo/redhat-list

[Index of Archives]     [CentOS]     [Kernel Development]     [PAM]     [Fedora Users]     [Red Hat Development]     [Big List of Linux Books]     [Linux Admin]     [Gimp]     [Asterisk PBX]     [Yosemite News]     [Red Hat Crash Utility]


  Powered by Linux