Re: High system CPU usage in one of a two node cluster

Lon Hohberger <lhh@xxxxxxxxxx> · Wed, 03 Jan 2007 11:39:31 -0500

On Wed, 2007-01-03 at 12:35 +0100, Marco Lusini wrote:
> Hi all,
>  
> I have 3 2-node clusters, running just cluster suite, without gfs,
> each one updated with the latest
> packages released by RHN.
>  
> In each cluster one of the two nodes has a steadily growing system CPU
> usage, which seems 
> to be consumed by clurgmgrd and dlm_recvd.
> As an example here is the running time accumulated on one cluster
> since 20 december when
> oit was rebooted:
>  
> [root@estestest ~]# ps axo pid,start,time,args
>   PID  STARTED     TIME COMMAND
> ...
> 10221   Dec 20 10:37:05 clurgmgrd
> 11169   Dec 20 06:48:24 [dlm_recvd]
> ...
>  
> [root@frascati ~]# ps axo pid,start,time,args
>   PID  STARTED     TIME COMMAND
> ...
>  6226   Dec 20 00:04:17 clurgmgrd
>  8249   Dec 20 00:00:19 [dlm_recvd]
> ...
>  
> I attach two graphs made with RRD which show that the system CPU usage
> is steadily growing:
> note how the trend changed after the reboot on 20 december.

> Of course as the system usage increases so does the system load and I
> am afraid of what will
> happen after 1-2 months of uptime...

System load averages are the average of the number of processes on the
run queue over the past 1, 5, and 15 minutes.  It doesn't generally
trend upwards over time; if that were the case, I'd be in trouble:

...
28204 15:11:11 01:04:19 /usr/lib/firefox-1.5.0.9/firefox-bin -UILocale
en-US
...

However, it is a little odd that you had 10 hours of runtime for
clurgmgrd and over 6 for dlm_recvd.  Just taking a wild guess, but it
looks like the locks were all mastered on frascati.

How many services are you running?

Also, take a look at:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212634

The RPMs there might solve the problem with dlm_recvd.  Rgmanager in
some situations causes a strange leak of NL locks in the DLM.  If
dlm_recvd has to traverse lock lists and that list is ever-growing
(total speculation here), it could explain the amount of consumed system
time.

-- Lon

Attachment:
signature.asc

Description: This is a digitally signed message part
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster