Somehow the issue has not returned since yesterday when I applied some tuning to our GFS, specifically:
glock_purge 50
demote_secs 100
scand_secs 5
statfs_fast 1
I'm not sure how this could be related to our DLM issues, but it seems to have some sort effect. We're able to push the IOPS limit of our SAN without problems now and system load is down to 1.5 from 6.
The cluster traffic is hard to judge since the cluster communicates over our 'Internet' interface (total traffic peaked at about 24Mbit/s). It might be better to route it over our internal network (once I put a decent switch in there), but how would I do that? I assume:
1. I should use hostnames that resolve to internal IP's in cluster.conf
2. I should change the bindnetaddr in openais.conf (it's now on the default 192.168.2.0, which is not a subnet we use)
Would that do the trick or am I missing something?
Martijn
On Thu, Feb 24, 2011 at 11:02 AM, Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote:
Hi,
Can you take a netstat -t while the cpu usage is at 100%, that will tell
On Thu, 2011-02-24 at 10:34 +0100, Martijn Storck wrote:
> Hello everyone,
>
>
> We currently have the following RHCS cluster in operation:
>
>
> - 3 nodes, Xeon CPU, 12 GB hardware etc.
> - 100mbit network between the cluster nodes
> - Dell MD3200i iSCSI SAN, with 4 Gbit links (dm-multipath) to each
> server (through two switches), 5 15k RPM spindles
> - 1 GFS1 file system on the above mentioned SAN
>
>
> 2 of the nodes share a single GFS file system, which is used for
> hosting virtual machine containers (for web serving, mail and light
> database work). We've noticed that performance is suboptimal so we've
> started to investigate. The load is not high (we previously ran the
> same containers on a single, much cheaper server using local 7200rpm
> disks and ext3 fs without issues), but there is a lot of small block
> I/O.
>
>
> When I run iptraf (only monitoring the iSCSI traffic) and top side by
> side on a single server I often see dlm_send using 100% CPU. During
> this time I/O to our gfs filesystem seems to be blocked and container
> performance goes down the drain.
>
us whether there is queued data at that point in time.
Well, that depends on how much traffic there is... have you measured the
>
> My question is: what causes dlm_send to use 100% CPU and is this wat
> causes the low GFS performance? Based on what the servers are doing
> I'm not expecting any deadlocks (they're mostly accessing separate
> parts of the filesystem), so I'm suspecting some other kind of
> limitation here. Could it be the 100Mbit network?
>
traffic when the problem is occurring?
It does sounds like a performance issue, and it shouldn't be too hard to
>
> I've looked into the waiters queue using the debug fs and it varies
> between 0 and 60 entries which doesn't seem to bad to me. The locks
> table has some 30.000 locks. All DLM and GFS settings are defaults.
> Any hints on where to look are appreciated!
>
get to the bottom of what is going on,
Steve.
>
> Regards,
>
>
> Martijn Storck
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster