We currently have the following RHCS cluster in operation:
- 3 nodes, Xeon CPU, 12 GB hardware etc.
- 100mbit network between the cluster nodes
- Dell MD3200i iSCSI SAN, with 4 Gbit links (dm-multipath) to each server (through two switches), 5 15k RPM spindles
- 1 GFS1 file system on the above mentioned SAN
2 of the nodes share a single GFS file system, which is used for hosting virtual machine containers (for web serving, mail and light database work). We've noticed that performance is suboptimal so we've started to investigate. The load is not high (we previously ran the same containers on a single, much cheaper server using local 7200rpm disks and ext3 fs without issues), but there is a lot of small block I/O.
When I run iptraf (only monitoring the iSCSI traffic) and top side by side on a single server I often see dlm_send using 100% CPU. During this time I/O to our gfs filesystem seems to be blocked and container performance goes down the drain.
My question is: what causes dlm_send to use 100% CPU and is this wat causes the low GFS performance? Based on what the servers are doing I'm not expecting any deadlocks (they're mostly accessing separate parts of the filesystem), so I'm suspecting some other kind of limitation here. Could it be the 100Mbit network?
I've looked into the waiters queue using the debug fs and it varies between 0 and 60 entries which doesn't seem to bad to me. The locks table has some 30.000 locks. All DLM and GFS settings are defaults. Any hints on where to look are appreciated!
Regards,
Martijn Storck
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster