RE: Many instances of CROND running, load raised (RH9)

"Kenneth Goodwin" <kgoodwin@xxxxxxxxxxxxxxxx> · Fri, 24 Oct 2003 16:12:16 -0400

EMBEDDED RESPONSE

>  -----Original Message-----
>  From: redhat-list-admin@xxxxxxxxxx
>  [mailto:redhat-list-admin@xxxxxxxxxx]On Behalf Of Cove
Schneider
>  Sent: Friday, October 24, 2003 3:24 PM
>  To: redhat-list@xxxxxxxxxx
>  Subject: Re: Many instances of CROND running, load raised
(RH9)
>
>
>  Hi Kenneth,
>
>  Thanks for the response.
>
>  On Friday, October 24, 2003, at 07:08 AM, Kenneth Goodwin
wrote:
>
>  > The CROND's without a PPID of 1 are the children of the
>  > master CROND process.
>  > They are trying to launch a command for you. Determine
what
>  > that command is first.
>  > There is no info herein that defines that.
>
>  They all have PPID's of 1, oddly enough. Good point.

wELL THAT IS NOT RIGHT, THAT probably means either your
system
got a major issue here or more likley they died already, are
orphaned
from their Parent CROND and are now inherited by INITD
(normal for this situation)
and INITD has not reaped the dead children yet probably
because VMWARE
is sucking up all the cpu cylces.
>
>  > 1) check the log file for CRON. See if you can
determine
>  > what process
>  > is being launched here.
>  >
>
>  The only thing listed in the cron log is the system
accounting shell
>  script. When I run it manually it executes very fast, but
>  with a high
>  system time:
>
>  seq:~# time /usr/lib/sa/sa1 1 1
>  /usr/lib/sa/sa1 1 1  0.00s user 0.02s system 173% cpu
0.012 total

Is vmware running full throttle will you are doing this?
I think it is the core part of your issue.
What happens to SA1 times iof you shutdown VMWARE and then
run it?
Whats the difference in elasped real time clock run times
between
sa1 on the affected systems and the sa1 on the baseline
system?
>
>  compared to another machine:
>
>  ack:~# time /usr/lib/sa/sa1 1 1
>  /usr/lib/sa/sa1 1 1  0.00s user 0.00s system 0% cpu 0.007
total
>
>  Oct 21 07:01:00 seq CROND[1533]: (root) CMD (run-parts
>  /etc/cron.hourly)
>  Oct 21 07:10:00 seq CROND[1545]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 07:20:00 seq CROND[1561]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 07:30:00 seq CROND[1573]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 07:40:00 seq CROND[1586]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 07:50:00 seq CROND[1603]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 08:00:00 seq CROND[1615]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 08:01:00 seq CROND[1618]: (root) CMD (run-parts
>  /etc/cron.hourly)
>  Oct 21 08:10:00 seq CROND[1629]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 08:20:00 seq CROND[1644]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 08:30:00 seq CROND[1656]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 08:40:00 seq CROND[1668]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 08:50:00 seq CROND[1685]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>  Oct 21 09:00:00 seq CROND[1697]: (root) CMD
(/usr/lib/sa/sa1 1 1)
>
>
>  > 2) dont think the problem really relates to the MTU
change,
>  > but a 9000
>  > MTU seems kinda high for gigabit. I could be  wrong.
>  > I would have to check, but I vaguely
>  > recall a recommended max MTU on my Antares GE cards of
>  > around 4KB.
>  > The change probably triggered an existing bug in one of
the
>  > programs cron is spawning for you. You might want to
recheck
>  > that setting for correctness. MTU's are usually
>  > in multiples of 1K less the ethernet mac packet header
size
>  > as I recall. Just make sure you are not setting above
the
>  > maxmium limity allowed.
>
>  That's what I was thinking, maybe it triggered a bug or a
>  legacy issue
>  in something.
>
>  The 9000 MTU is for Jumbo Frames, which I believe are
9000 bytes and
>  up. I was able to verify that it worked on the machine
using
>  ping and
>  doing packet captures (tg3 driver w/ bcom cards).
>
>  See:
>  http://sd.wareonearth.com/~phil/jumbo.html
>  http://www.uoregon.edu/~joe/jumbo-clean-gear.html
>
>  > 3) a ps -ef or equivalent will show you the children
and
>  > parents.
>  > do a ps -ef | grep [a crond child pid] and see if there
is a
>  > child process
>  > hanging off the CROND sub child process )
>  > using some sample PIDs from your PS below
>  >  (master CROND (PID = 2415 (example), PPID =
1) -Forks->
>  > Child CROND service process (PID = 2426 from your list,
PPID
>  > = 2415)
>  > So what process if any if 2426 forking off here?
>
>  It just forks another crond, I'm guessing it's forking so
it
>  can then
>  call exec(2), but then exec(2) is hanging for some
reason. I
>  don't know
>  why initd is calling crond though.

Initd strats up all the sub processes in the system in the
RC scripts
It is the Master parent process and should be the PPID for
the INITIALLY
launched CROND daemon only. It becomes the PPID of children
that are orphaned
by their parents in a weird way.
INITD->CROND->CRONDSUB->YOURCRONTASK

if the CRONDSUB process does a fork and exec and wait cycle
on YOURCRONTASK
instead of a straight EXEC call to replace itself with your
process,
which is probably the case sinc ethe CRONSUB probably
handles the logging
in the cron log file as well, then
if the CRONDSUB process dies before YOURCRONTASK dies, then
the new PPID
of YOURCRONTASK will become INITD. I believe MASTER CROND is
forking it's
children CRINDSUB's as separate process groups to protect
itself against
stray signalling to the proces group that might kill it off.

The crond forking crond are weird, suspect the second one
(2416) off of the
first 2415 is the actual entry from your crontab being
launched and based
on your statements should actually be showing up as SA1 at
this point.

you could Shutdown the calls to SA1 and the rest of the
accountiong package
in the cron files, resubmit the root cronn table and see if
the problem
follows SA1 and goes away. If it does, you will need to tear
apart
the SA package to see what is wrong, maybe a corrupted data
or control file
or SA cant handle accounting records for TCP/IP with super
large frame sizes
(code oversights) if it is monitoring that subsystem as well
as the disks

>
>  Here's the tree:
>
>  # pstree -p
>  init(1)-+-aacraid(19)
>           |-atd(1872)
>           |-bdflush(11)
>           |-crond(2415)---crond(2416)
>           |-crond(2426)---crond(2427)
>           |-crond(2442)---crond(2443)
>           |-crond(2453)---crond(2454)
>           |-crond(2455)---crond(2456)
>           |-crond(2466)---crond(2467)
>           |-crond(2480)---crond(2481)
>           |-crond(2491)---crond(2492)
>           |-crond(2502)---crond(2503)
>           |-crond(2515)---crond(2516)
>           |-crond(2529)---crond(2530)
>           |-crond(2531)---crond(2532)
>           |-crond(2542)---crond(2543)
>           |-crond(2553)---crond(2554)
>           |-crond(2567)---crond(2568)
>           |-crond(2578)---crond(2579)
>           |-crond(2591)---crond(2592)
>           |-crond(2606)---crond(2607)
>           |-crond(2608)---crond(2609)
>           |-crond(2619)---crond(2620)
>           |-crond(2630)---crond(2631)
>           |-crond(2644)---crond(2645)
>           |-crond(2655)---crond(2656)
>           |-crond(2668)---crond(2669)
>           |-crond(2682)---crond(2683)
>           |-crond(2684)---crond(2685)
>           |-crond(2695)---crond(2696)
>           |-crond(2706)---crond(2707)
>           |-crond(2720)---crond(2721)
>           |-crond(2731)---crond(2732)
>           |-crond(2744)---crond(2745)
>           |-crond(2758)---crond(2759)
>           |-crond(2760)---crond(2761)
>           |-crond(2771)---crond(2772)
>           |-crond(2782)---crond(2783)
>           |-crond(2796)---crond(2797)
>           |-crond(2818)---crond(2819)
>           |-crond(2832)---crond(2833)
>           |-crond(2834)---crond(2835)
>           |-crond(2845)---crond(2846)
>           |-crond(2856)---crond(2857)
>           |-crond(2870)---crond(2871)
>           |-crond(2881)---crond(2882)
>           |-crond(2892)---crond(2893)
>           |-crond(2907)---crond(2908)
>           |-crond(2909)---crond(2910)
>           |-crond(2920)---crond(2921)
>           |-crond(2931)---crond(2932)
>           |-crond(2943)---crond(2944)
>           |-crond(2807)---crond(2808)
>           |-crond(9823)
>  ...
>
>  # ps axj
>    PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME
COMMAND
>       0     1     0     0 ?           -1 S        0   0:17
init
>  ...
>       1  2415  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2426  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2442  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2453  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2455  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2466  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2480  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2491  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2502  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2515  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2529  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2531  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2542  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2553  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2567  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2578  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2591  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2606  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2608  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2619  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2630  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2644  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2655  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2668  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2682  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2684  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2695  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2706  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2720  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2731  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2744  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2758  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2760  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2771  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2782  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2796  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2818  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2832  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2834  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2845  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2856  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2870  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2881  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2892  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2907  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2909  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2920  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2931  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2943  1547  1547 ?           -1 D        0   0:00
CROND
>       1  2807  1547  1547 ?           -1 D        0   0:00
CROND
>       1  9823  9823  9823 ?           -1 S        0   0:00
crond
>
>  This last entry here is the "real" crond. I've restarted
it a few
>  times, that's why the PID is so high.
>
>  > 4) Your real problem seems to be that VMware PID 2003
is
>  > absorbing
>  > 99% of the CPU doing "something". I dont know from your
info
>  > here
>  > If this is a CRON launched process or if this is a
separate
>  > process
>  > launched by INITD. But this is probably the real issue.
It
>  > is probably
>  > stopping the CROND from getting enough kernel time to
>  > launch. But cant
>  > tell what is really going on here from your info.  But
check
>  > into the VMWARE
>  > behaviour - Log Files? I dont know about vmware so cant
help
>  > you there.
>
>  My feeling is that the reported CPU usage for VMware is
>  being skewed by
>  the blocking CRONDs, as well as the load ave.

No I think VMware is the issue, the CROND's seem to be dead
children
dont know why they are spawning like this, you list does not
show start times
it could be that they are normally launched 5 minute
interval SA's that are all
hanging asleep in some deadlock state on a resource VMware
has locked.

I would focus on why VMware is misbehaving here, it's eating
the CPU alive.

>
>  If you compare the machine, named seq, with the 50.0 load
ave.:
>
>  seq:~# vmstat 1
>      procs                      memory      swap
io
>     system
>      cpu
>    r  b  w   swpd   free   buff  cache   si   so    bi
bo
>   in    cs
>  us sy id
>    1 50  0  49916   6652  23020 434396    0    0    13
10
>   11     6
>  2  2  8
>    0 50  0  49916   6652  23024 434392    0    0     0
76
>  215  1288
>  2 61 37
>    0 50  0  49916   6652  23024 434392    0    0     0
0
>  187  1190
>  0 60 40
>    0 50  0  49916   6652  23024 434392    0    0     0
0
>  189  1228
>  1 56 42
>    0 50  0  49916   6652  23024 434392    0    0     0
0
>  182  1246
>  1 54 45
>    0 50  0  49916   6652  23024 434392    0    0     0
0
>  191  1281
>  2 56 42
>    0 50  0  49916   6652  23024 434392    0    0     0
48
>  191  1264
>  1 52 46
>    0 50  0  49916   6652  23024 434392    0    0     0
0
>  191  1241
>  1 54 45
>    0 50  0  49916   6652  23024 434392    0    0     0
0
>  192  1210
>  0 56 44
>
>  To this machine, named ack, which is doing even more work
(note the
>  higher cs [context switches]):

Processes going in and out of run state basically, could be
VMware
in  a high speed loop blocking on something for a
microsecond and then
waking up on a alarm timeout (one of the functions of
select() as I recall is to
also control fine granularity microtimers() aND ALARMS AS
WELL as await events on TCP/IP sockets.

>
>  ack:~# vmstat 1
>      procs                      memory      swap
io
>     system
>      cpu
>    r  b  w   swpd   free   buff  cache   si   so    bi
bo
>   in    cs
>  us sy id
>    0  0  0  43384   9580  26676 905376    0    0     5
3
>    2     2
>  1  7  7
>    0  0  0  43384   9564  26676 905380    0    4     4
204
>  462  2313
>  0  3 97
>    0  0  0  43384   9564  26676 905380    0    0     0
0
>  351  2311
>  0  2 98
>    0  0  0  43384   9568  26676 905400    0    0    12
4
>  376  2384
>  0  3 97
>    0  0  0  43384   9564  26676 905400    0    0     0
196
>  396  2238
>  0  3 97
>    0  0  0  43384   9568  26676 905400    0    0     0
0
>  336  2297
>  0  3 97
>    0  0  0  43384   9568  26676 905400    0    0     0
0
>  310  2267
>  0  1 98
>    0  0  0  43384   9568  26676 905400    0    0     0
4
>  342  2284
>  0  1 99
>    0  0  0  43384   9556  26676 905416    0    0     8
0
>  350  2309
>  0  4 96
>
>  I'd expect to find that the first machine here, seq,
would be doing
>  more work since its CPU idle time is lower. Though it's
>  actually doing
>  about half as much as the machine with the 90% idle time.
>
>  Here's what vmware is doing on seq:
>
>  seq:~# strace -c -T -p 2003
>  % time     seconds  usecs/call     calls    errors
syscall
>  ------ ----------- ----------- --------- --------- ------
----------
>    99.57    1.878046        4278       439
select

this is either a spin loop on a TCP/IP resource or VMware is
trying
to get access to a resource and is using the select()
microtimer interface
to abort the call if it blocks for more than N microseconds
or so.

Does VMware and SA have any system resource in common? I am
not familiar
with how VMware operates or what it does? seems like
something may be
out of whack here, that both SA and Vmware need access to
and VMware
is spinning it's wheels and sucking cpu trying to gain
access
to that resource while  SA is blocking on it, not dying in
the
approved timeframes and is being joined in a few minutes by
the next
instance of SA launched by CROND, so you eevntually collect
a sleeping
crowd of SA's in additon to the VMware issue.

Maybe an on-disk lock file left behind that an RC.d script
should be removing?
I cant guess, I think you need to find out the actual
resources
that VMware and Sa are blocking on. The Wchan column in PS
which is a kernel virtual menory address of the resource
being waited on compared against
kernel symbol table name to address map may provide the
needed clue
to what resource is hanging these two.

I think it is the NM command that dumps symbol tables
but i am not sure if this is correct for Linux, but I think
there is a file map under /boot that also contains a map of
addresses to loader symbol names for use in dynamically
loading
kernel library objects

>     0.18    0.003455           8       459
gettimeofday
>     0.11    0.002108         211        10           write
>     0.05    0.000893           8       112           ioctl
>     0.03    0.000637           4       145
getpid
>     0.01    0.000239          48         5
unlink
>     0.01    0.000209          42         5           mknod
>     0.01    0.000186          19        10           open
>     0.01    0.000175          18        10           close
>     0.01    0.000113          23         5
stat64
>     0.00    0.000052           5        10           kill
>  ------ ----------- ----------- --------- --------- ------
----------
>  100.00    1.886113                  1210           total
>
>  Verses, ack (aprox. the same amount of capture time):
>
>  ack:~# strace -T -c -p 2158
>
>  % time     seconds  usecs/call     calls    errors
syscall
>  ------ ----------- ----------- --------- --------- ------
----------
>    99.06    4.555363        7431       613
select
>     0.46    0.021101         237        89           ioctl
>     0.23    0.010421         197        53           write
>     0.19    0.008532          14       610
gettimeofday
>     0.02    0.001068           7       146
getpid
>     0.01    0.000334         334         1
statfs
>     0.01    0.000319           8        38         3 read
>     0.01    0.000306          51         6
unlink
>     0.01    0.000267          45         6           mknod
>     0.01    0.000265          22        12           close
>     0.01    0.000264          22        12           open
>     0.00    0.000144          29         5
stat64
>     0.00    0.000052           7         8           kill
>  ------ ----------- ----------- --------- --------- ------
----------
>  100.00    4.598436                  1599         3 total

man  or info select(0 to verify microtimer feature.

>
>  I donno... Any ideas? I can't do a strace on the CROND
either, it's
>  blocking too hard to even see what call it's blocking on.
>
>  Thanks,
>
>  Cove
>
>
>  --
>  redhat-list mailing list
>  unsubscribe
mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
>  https://www.redhat.com/mailman/listinfo/redhat-list
>

-- 
redhat-list mailing list
unsubscribe mailto:redhat-list-request@xxxxxxxxxx?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list