100% CPU load of dlm_controld

Julian Pawlowski <julian.pawlowski@xxxxxxxxx> · Thu, 14 Feb 2013 11:03:04 +0100

Hello,
I am currently investigating an issue with dlm_controld.

After we did some performance improvements the cpu load of dlm_controld becomes nearly 100% on all 3 nodes and locking goes down from 45.000/s to 3/s ...

I have a feeling this has something to do with plock_rate_limit which we disabled in cluster.conf by

        <dlm plock_ownership="1" plock_rate_limit="0"/>

        <gfs_controld plock_rate_limit="0" />

We are still on RHEL 6.2 and I'm not sure if there are major improvements in dlm_controld for RHEL 6.3 (looking at the Github repo of dlm there seem to be quite some improvements in general, e.g. fencing).

Would anybody have a suggestion what we could test?

All in all, here are some specs about the systems:

- 3 nodes running RHEL 6.2

- 128GB Ram
- 64 Cores
- FCoE SAN
- 3 NIC: 1x SAN, 1x LAN, 1x Cluster LAN
- mainly running SAS and related jobs
- fencing enabled with fence_ipmilan

Other performance related settings:
- tuned-adm profile enterprise-storage
- echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- blockdev --setra 1024 (for each FC block device)

- vm.dirty_background_ratio = 0
- vm.vfs_cache_pressure = 0
- vm.swappiness = 45
- vm.min_free_kbytes = 1976531
- echo 16384 > /sys/kernel/config/dlm/cluster/lkbtbl_size (set before GFS2 mount)

- echo 16384 > /sys/kernel/config/dlm/cluster/rsbtbl_size (set before GFS2 mount)
- echo 16384 > /sys/kernel/config/dlm/cluster/dirtbl_size (set before GFS2 mount)

With these settings we get quite good performance at the beginning but dlm_controld gets stuck after half an hour or so.

I thought about setting plock_rate_limit=500 or something like this. Do you think this would be a better setting instead of using unlimited?

Cheers,
Julian

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster