Lon Hohberger wrote: > On Wed, 2007-01-03 at 12:35 +0100, Marco Lusini wrote: >> Hi all, >> >> I have 3 2-node clusters, running just cluster suite, without gfs, >> each one updated with the latest >> packages released by RHN. >> >> In each cluster one of the two nodes has a steadily growing system CPU >> usage, which seems >> to be consumed by clurgmgrd and dlm_recvd. >> As an example here is the running time accumulated on one cluster >> since 20 december when >> oit was rebooted: >> >> [root@estestest ~]# ps axo pid,start,time,args >> PID STARTED TIME COMMAND >> ... >> 10221 Dec 20 10:37:05 clurgmgrd >> 11169 Dec 20 06:48:24 [dlm_recvd] >> ... >> >> [root@frascati ~]# ps axo pid,start,time,args >> PID STARTED TIME COMMAND >> ... >> 6226 Dec 20 00:04:17 clurgmgrd >> 8249 Dec 20 00:00:19 [dlm_recvd] >> ... I suspect these two being at the top are related. If clurgmgrd is taking out locks then dlm_recvd will also be busy >> I attach two graphs made with RRD which show that the system CPU usage >> is steadily growing: >> note how the trend changed after the reboot on 20 december. > >> Of course as the system usage increases so does the system load and I >> am afraid of what will >> happen after 1-2 months of uptime... > > System load averages are the average of the number of processes on the > run queue over the past 1, 5, and 15 minutes. It doesn't generally > trend upwards over time; if that were the case, I'd be in trouble: > > ... > 28204 15:11:11 01:04:19 /usr/lib/firefox-1.5.0.9/firefox-bin -UILocale > en-US > ... > > However, it is a little odd that you had 10 hours of runtime for > clurgmgrd and over 6 for dlm_recvd. Just taking a wild guess, but it > looks like the locks were all mastered on frascati. > > How many services are you running? > > Also, take a look at: > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212634 > > The RPMs there might solve the problem with dlm_recvd. Rgmanager in > some situations causes a strange leak of NL locks in the DLM. If > dlm_recvd has to traverse lock lists and that list is ever-growing > (total speculation here), it could explain the amount of consumed system > time. > Yes, DLM will do a lot of traversing lock lists if there are a lot of similar locks on one resource. VMS has an optimisation on this known as the group grant and concversion grant modes that we don't currently implement. > How can I get more info on this? I checked /proc/cluster/dlm_locks > on both nodes and it is empty. /proc/cluster/dlm_locks needs to be told which lockspace to use. Just catting that file after bootup will show nothing. What you need to do is to echo the lockspace name into that file, then look a it. You can get the lockspace names with the "cman_tool services" command so (eg) # cman_tool services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2] DLM Lock Space: "clvmd" 2 3 run - [1 2] # echo "clvmd" > /proc/cluster/dlm_locks # cat /proc/cluster/dlm_locks This shows locks held by clvmd. If you want to look at another lockspace just echo the other name into the /proc file. -- patrick -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster