Re: Many instances of CROND running, load raised (RH9)

Cove Schneider <cove@xxxxxxxxxxxxxxx> · Fri, 24 Oct 2003 12:24:19 -0700

Hi Kenneth,

Thanks for the response.

On Friday, October 24, 2003, at 07:08 AM, Kenneth Goodwin wrote:

The CROND's without a PPID of 1 are the children of the
master CROND process.
They are trying to launch a command for you. Determine what
that command is first.
There is no info herein that defines that.

They all have PPID's of 1, oddly enough. Good point.

1) check the log file for CRON. See if you can determine
what process
is being launched here.

The only thing listed in the cron log is the system accounting shell 
script. When I run it manually it executes very fast, but with a high 
system time:

seq:~# time /usr/lib/sa/sa1 1 1
/usr/lib/sa/sa1 1 1  0.00s user 0.02s system 173% cpu 0.012 total

compared to another machine:

ack:~# time /usr/lib/sa/sa1 1 1
/usr/lib/sa/sa1 1 1  0.00s user 0.00s system 0% cpu 0.007 total

Oct 21 07:01:00 seq CROND[1533]: (root) CMD (run-parts /etc/cron.hourly)
Oct 21 07:10:00 seq CROND[1545]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:20:00 seq CROND[1561]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:30:00 seq CROND[1573]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:40:00 seq CROND[1586]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 07:50:00 seq CROND[1603]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:00:00 seq CROND[1615]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:01:00 seq CROND[1618]: (root) CMD (run-parts /etc/cron.hourly)
Oct 21 08:10:00 seq CROND[1629]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:20:00 seq CROND[1644]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:30:00 seq CROND[1656]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:40:00 seq CROND[1668]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 08:50:00 seq CROND[1685]: (root) CMD (/usr/lib/sa/sa1 1 1)
Oct 21 09:00:00 seq CROND[1697]: (root) CMD (/usr/lib/sa/sa1 1 1)

2) dont think the problem really relates to the MTU change,
but a 9000
MTU seems kinda high for gigabit. I could be  wrong.
I would have to check, but I vaguely
recall a recommended max MTU on my Antares GE cards of
around 4KB.
The change probably triggered an existing bug in one of the
programs cron is spawning for you. You might want to recheck
that setting for correctness. MTU's are usually
in multiples of 1K less the ethernet mac packet header size
as I recall. Just make sure you are not setting above the
maxmium limity allowed.

That's what I was thinking, maybe it triggered a bug or a legacy issue 
in something.

The 9000 MTU is for Jumbo Frames, which I believe are 9000 bytes and 
up. I was able to verify that it worked on the machine using ping and 
doing packet captures (tg3 driver w/ bcom cards).

See:
http://sd.wareonearth.com/~phil/jumbo.html
http://www.uoregon.edu/~joe/jumbo-clean-gear.html

3) a ps -ef or equivalent will show you the children and
parents.
do a ps -ef | grep [a crond child pid] and see if there is a
child process
hanging off the CROND sub child process )
using some sample PIDs from your PS below
 (master CROND (PID = 2415 (example), PPID = 1) -Forks->
Child CROND service process (PID = 2426 from your list, PPID
= 2415)
So what process if any if 2426 forking off here?

It just forks another crond, I'm guessing it's forking so it can then 
call exec(2), but then exec(2) is hanging for some reason. I don't know 
why initd is calling crond though.

Here's the tree:

# pstree -p
init(1)-+-aacraid(19)
        |-atd(1872)
        |-bdflush(11)
        |-crond(2415)---crond(2416)
        |-crond(2426)---crond(2427)
        |-crond(2442)---crond(2443)
        |-crond(2453)---crond(2454)
        |-crond(2455)---crond(2456)
        |-crond(2466)---crond(2467)
        |-crond(2480)---crond(2481)
        |-crond(2491)---crond(2492)
        |-crond(2502)---crond(2503)
        |-crond(2515)---crond(2516)
        |-crond(2529)---crond(2530)
        |-crond(2531)---crond(2532)
        |-crond(2542)---crond(2543)
        |-crond(2553)---crond(2554)
        |-crond(2567)---crond(2568)
        |-crond(2578)---crond(2579)
        |-crond(2591)---crond(2592)
        |-crond(2606)---crond(2607)
        |-crond(2608)---crond(2609)
        |-crond(2619)---crond(2620)
        |-crond(2630)---crond(2631)
        |-crond(2644)---crond(2645)
        |-crond(2655)---crond(2656)
        |-crond(2668)---crond(2669)
        |-crond(2682)---crond(2683)
        |-crond(2684)---crond(2685)
        |-crond(2695)---crond(2696)
        |-crond(2706)---crond(2707)
        |-crond(2720)---crond(2721)
        |-crond(2731)---crond(2732)
        |-crond(2744)---crond(2745)
        |-crond(2758)---crond(2759)
        |-crond(2760)---crond(2761)
        |-crond(2771)---crond(2772)
        |-crond(2782)---crond(2783)
        |-crond(2796)---crond(2797)
        |-crond(2818)---crond(2819)
        |-crond(2832)---crond(2833)
        |-crond(2834)---crond(2835)
        |-crond(2845)---crond(2846)
        |-crond(2856)---crond(2857)
        |-crond(2870)---crond(2871)
        |-crond(2881)---crond(2882)
        |-crond(2892)---crond(2893)
        |-crond(2907)---crond(2908)
        |-crond(2909)---crond(2910)
        |-crond(2920)---crond(2921)
        |-crond(2931)---crond(2932)
        |-crond(2943)---crond(2944)
        |-crond(2807)---crond(2808)
        |-crond(9823)
...

# ps axj
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    0     1     0     0 ?           -1 S        0   0:17 init
...
    1  2415  1547  1547 ?           -1 D        0   0:00 CROND
    1  2426  1547  1547 ?           -1 D        0   0:00 CROND
    1  2442  1547  1547 ?           -1 D        0   0:00 CROND
    1  2453  1547  1547 ?           -1 D        0   0:00 CROND
    1  2455  1547  1547 ?           -1 D        0   0:00 CROND
    1  2466  1547  1547 ?           -1 D        0   0:00 CROND
    1  2480  1547  1547 ?           -1 D        0   0:00 CROND
    1  2491  1547  1547 ?           -1 D        0   0:00 CROND
    1  2502  1547  1547 ?           -1 D        0   0:00 CROND
    1  2515  1547  1547 ?           -1 D        0   0:00 CROND
    1  2529  1547  1547 ?           -1 D        0   0:00 CROND
    1  2531  1547  1547 ?           -1 D        0   0:00 CROND
    1  2542  1547  1547 ?           -1 D        0   0:00 CROND
    1  2553  1547  1547 ?           -1 D        0   0:00 CROND
    1  2567  1547  1547 ?           -1 D        0   0:00 CROND
    1  2578  1547  1547 ?           -1 D        0   0:00 CROND
    1  2591  1547  1547 ?           -1 D        0   0:00 CROND
    1  2606  1547  1547 ?           -1 D        0   0:00 CROND
    1  2608  1547  1547 ?           -1 D        0   0:00 CROND
    1  2619  1547  1547 ?           -1 D        0   0:00 CROND
    1  2630  1547  1547 ?           -1 D        0   0:00 CROND
    1  2644  1547  1547 ?           -1 D        0   0:00 CROND
    1  2655  1547  1547 ?           -1 D        0   0:00 CROND
    1  2668  1547  1547 ?           -1 D        0   0:00 CROND
    1  2682  1547  1547 ?           -1 D        0   0:00 CROND
    1  2684  1547  1547 ?           -1 D        0   0:00 CROND
    1  2695  1547  1547 ?           -1 D        0   0:00 CROND
    1  2706  1547  1547 ?           -1 D        0   0:00 CROND
    1  2720  1547  1547 ?           -1 D        0   0:00 CROND
    1  2731  1547  1547 ?           -1 D        0   0:00 CROND
    1  2744  1547  1547 ?           -1 D        0   0:00 CROND
    1  2758  1547  1547 ?           -1 D        0   0:00 CROND
    1  2760  1547  1547 ?           -1 D        0   0:00 CROND
    1  2771  1547  1547 ?           -1 D        0   0:00 CROND
    1  2782  1547  1547 ?           -1 D        0   0:00 CROND
    1  2796  1547  1547 ?           -1 D        0   0:00 CROND
    1  2818  1547  1547 ?           -1 D        0   0:00 CROND
    1  2832  1547  1547 ?           -1 D        0   0:00 CROND
    1  2834  1547  1547 ?           -1 D        0   0:00 CROND
    1  2845  1547  1547 ?           -1 D        0   0:00 CROND
    1  2856  1547  1547 ?           -1 D        0   0:00 CROND
    1  2870  1547  1547 ?           -1 D        0   0:00 CROND
    1  2881  1547  1547 ?           -1 D        0   0:00 CROND
    1  2892  1547  1547 ?           -1 D        0   0:00 CROND
    1  2907  1547  1547 ?           -1 D        0   0:00 CROND
    1  2909  1547  1547 ?           -1 D        0   0:00 CROND
    1  2920  1547  1547 ?           -1 D        0   0:00 CROND
    1  2931  1547  1547 ?           -1 D        0   0:00 CROND
    1  2943  1547  1547 ?           -1 D        0   0:00 CROND
    1  2807  1547  1547 ?           -1 D        0   0:00 CROND
    1  9823  9823  9823 ?           -1 S        0   0:00 crond

This last entry here is the "real" crond. I've restarted it a few 
times, that's why the PID is so high.

4) Your real problem seems to be that VMware PID 2003 is
absorbing
99% of the CPU doing "something". I dont know from your info
here
If this is a CRON launched process or if this is a separate
process
launched by INITD. But this is probably the real issue. It
is probably
stopping the CROND from getting enough kernel time to
launch. But cant
tell what is really going on here from your info.  But check
into the VMWARE
behaviour - Log Files? I dont know about vmware so cant help
you there.

My feeling is that the reported CPU usage for VMware is being skewed by 
the blocking CRONDs, as well as the load ave.

If you compare the machine, named seq, with the 50.0 load ave.:

seq:~# vmstat 1

   procs                      memory      swap          io     system   
   cpu

 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs 
us sy id

 1 50  0  49916   6652  23020 434396    0    0    13    10   11     6  
2  2  8

 0 50  0  49916   6652  23024 434392    0    0     0    76  215  1288  
2 61 37

 0 50  0  49916   6652  23024 434392    0    0     0     0  187  1190  
0 60 40

 0 50  0  49916   6652  23024 434392    0    0     0     0  189  1228  
1 56 42

 0 50  0  49916   6652  23024 434392    0    0     0     0  182  1246  
1 54 45

 0 50  0  49916   6652  23024 434392    0    0     0     0  191  1281  
2 56 42

 0 50  0  49916   6652  23024 434392    0    0     0    48  191  1264  
1 52 46

 0 50  0  49916   6652  23024 434392    0    0     0     0  191  1241  
1 54 45

 0 50  0  49916   6652  23024 434392    0    0     0     0  192  1210  
0 56 44

To this machine, named ack, which is doing even more work (note the 
higher cs [context switches]):

ack:~# vmstat 1

   procs                      memory      swap          io     system   
   cpu

 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs 
us sy id

 0  0  0  43384   9580  26676 905376    0    0     5     3    2     2  
1  7  7

 0  0  0  43384   9564  26676 905380    0    4     4   204  462  2313  
0  3 97

 0  0  0  43384   9564  26676 905380    0    0     0     0  351  2311  
0  2 98

 0  0  0  43384   9568  26676 905400    0    0    12     4  376  2384  
0  3 97

 0  0  0  43384   9564  26676 905400    0    0     0   196  396  2238  
0  3 97

 0  0  0  43384   9568  26676 905400    0    0     0     0  336  2297  
0  3 97

 0  0  0  43384   9568  26676 905400    0    0     0     0  310  2267  
0  1 98

 0  0  0  43384   9568  26676 905400    0    0     0     4  342  2284  
0  1 99

 0  0  0  43384   9556  26676 905416    0    0     8     0  350  2309  
0  4 96

I'd expect to find that the first machine here, seq, would be doing 
more work since its CPU idle time is lower. Though it's actually doing 
about half as much as the machine with the 90% idle time.

Here's what vmware is doing on seq:

seq:~# strace -c -T -p 2003
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.57    1.878046        4278       439           select
  0.18    0.003455           8       459           gettimeofday
  0.11    0.002108         211        10           write
  0.05    0.000893           8       112           ioctl
  0.03    0.000637           4       145           getpid
  0.01    0.000239          48         5           unlink
  0.01    0.000209          42         5           mknod
  0.01    0.000186          19        10           open
  0.01    0.000175          18        10           close
  0.01    0.000113          23         5           stat64
  0.00    0.000052           5        10           kill
------ ----------- ----------- --------- --------- ----------------
100.00    1.886113                  1210           total

Verses, ack (aprox. the same amount of capture time):

ack:~# strace -T -c -p 2158

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.06    4.555363        7431       613           select
  0.46    0.021101         237        89           ioctl
  0.23    0.010421         197        53           write
  0.19    0.008532          14       610           gettimeofday
  0.02    0.001068           7       146           getpid
  0.01    0.000334         334         1           statfs
  0.01    0.000319           8        38         3 read
  0.01    0.000306          51         6           unlink
  0.01    0.000267          45         6           mknod
  0.01    0.000265          22        12           close
  0.01    0.000264          22        12           open
  0.00    0.000144          29         5           stat64
  0.00    0.000052           7         8           kill
------ ----------- ----------- --------- --------- ----------------
100.00    4.598436                  1599         3 total

I donno... Any ideas? I can't do a strace on the CROND either, it's 
blocking too hard to even see what call it's blocking on.

Thanks,

Cove

--
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list