Chris Feist wrote:
Yes, issue #2 could definitely be the cause of your first issue.
Unfortunately you'll need to bring down your cluster to change the value
of lt_high_locks. What is its value currently? And how much memory do
you have on your gulm lock servers? You'll need about 256M of RAM for
gulm for every 1 Million locks (plus enough for any other process and
kernel).
On each of the gulm clients you can also cat /proc/gulm/lockspace to see
which client is using most of the locks.
Thanks for the response! I figured I would probably have to bring down
the cluster to change the highwater setting, but I was hoping a bit that
it could be changed dynamically. Oh well.
The value is currently at the default, which I want to say is something
like 1.04M. These machines are both lock servers and samba/NFS servers,
and have 4GB of RAM available (I have three lock servers in the cluster,
and all three have 4GB of RAM). A previous RedHat service call has me
running the hugemem kernel on all three (the issue there was that, under
just light activity loading, lowmem would be exhausted and the machines
would enter an OOM spiral of death). Now that I have turned off
hyperthreading, though, memory usage seems to be dramatically lower than
it was prior to that change. For instance, the machine running samba
services has been running since I turned off hyperthreading on Friday
night. Today, the machine was under some pretty heavy load. On a
normal day, prior to the hyperthreading change, I'd be down to maybe
500MB of lowmem free right now (out of 3GB). The only way to completely
reclaim that memory would be to reboot. So, now I'm sitting here
looking at this machine, and it has 3.02GB of 3.31GB free. I'm going to
have to let this run for a while to determine if this is a red herring,
but it looks much better than it ever has in the past.
Here's the interesting output from the /proc/gulm gadgets (note that, at
the time I grabbed these, I was seeing the "more than the max" message
logged to syslog between once and twice per minute, but not at the
10-second rate that I read about previously):
[root@xxxxx root]# cat /proc/gulm/filesystems/data0
Filesystem: data0
JID: 0
handler_queue_cur: 0
handler_queue_max: 26584
[root@xxxxx root]# cat /proc/gulm/filesystems/data1
Filesystem: data1
JID: 0
handler_queue_cur: 0
handler_queue_max: 4583
[root@xxxxx root]# cat /proc/gulm/filesystems/data2
Filesystem: data2
JID: 0
handler_queue_cur: 0
handler_queue_max: 11738
[root@xxxxx root]# cat /proc/gulm/lockspace
lock counts:
total: 41351
unl: 29215
exl: 3
shd: 12055
dfr: 0
pending: 0
lvbs: 16758
lops: 12597867
[root@xxxxx root]#
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster