Re: gfs tuning

Wendy Cheng <s.wendy.cheng@xxxxxxxxx> · Thu, 19 Jun 2008 11:42:17 -0400

Wendy Cheng wrote:
Terry wrote:
On Tue, Jun 17, 2008 at 5:22 PM, Terry <td3201@xxxxxxxxx> wrote:

On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng 
<s.wendy.cheng@xxxxxxxxx> wrote:

Hi, Terry,

I am still seeing some high load averages.  Here is an example of a
gfs configuration.  I left statfs_fast off as it would not apply to
one of my volumes for an unknown reason.  Not sure that would have
helped anyways.  I do, however, feel that reducing scand_secs 
helped a
little:

Sorry I missed scand_secs (was mindless as the brain was mostly 
occupied by
day time work).

To simplify the view, glock states include exclusive (write), share 
(read),
and not-locked (in reality, there are more). Exclusive lock has to be
demoted (demote_secs) to share, then to not-locked (another 
demote_secs)
before it is scanned (every scand_secs) to get added into reclaim 
list where
it can be purged. Between exclusive and share state transition, the 
file
contents need to get flushed to disk (to keep file content cluster
coherent).  All of above assume the file (protected by this glock) 
is not
accessed (idle).

You hit an area that GFS normally doesn't perform well. With GFS1 in
maintenance mode while GFS2 seems to be so far away, ext3 could be 
a better
answer. However, before switching, do make sure to test it 
thoroughly (since
Ext3 could have the very same issue as well - check out:
http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).

Did you look (and test) GFS "nolock" protocol (for single node 
GFS)? It
bypasses some locking overhead and can be switched to  DLM in the 
future
(just make sure you reserve enough journal space - the rule of 
thumb is one
journal per node and know how many nodes you plan to have in the 
future).

-- Wendy

Good points.  I could try the nolock feature I suppose.  Not quite
clear on how to reserve journal space.  I forgot to post the cpu time,
check out this:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4822 root      10  -5     0    0    0 S    1  0.0   2159:15 dlm_recv
 4820 root      10  -5     0    0    0 S    1  0.0 368:09.34 dlm_astd
 4821 root      10  -5     0    0    0 S    0  0.0 153:06.80 dlm_scand
 3659 root      10  -5     0    0    0 S    0  0.0 134:40.14 scsi_wq_4
 4823 root      11  -5     0    0    0 S    1  0.0 109:33.33 dlm_send
 367 root      10  -5     0    0    0 S    0  0.0 103:33.74 kswapd0

gfs_glockd is further below so not so concerned with that right now.
It appears turning on nolock would do the trick.  The times aren't
extremely accurate because I have failed this cluster between nodes
while testing.

Here is some more testing information....

I created a new volume on my iscsi san of 1 TB and formatted it for
ext3. I then used dd to create a 100G file.  This yielded roughly 900
Mb/sec.  I then stopped my application and did the same thing with an
existing GFS volume.  This gave me about 850 Kb/sec.  This isn't an
iscsi issue.  This appears to be a load issue and the number of I/O
occurring on these volumes.  That said, I would expect that performing
the changes I did would result in a major performance improvement.
Since it didn't, what are my other points I could consider?   If its a
GFS issue, ext3 is the way to go.  Maybe even switch to using
active-active on my NFS cluster.   If its a backend disk issue, I
would expect to see the throughput on my iscsi link (bond1) be fully
utilized.  Its not.  Could I be thrashing the disks?  This is an iscsi
san with 30 sata disks.  Just bouncing some thoughts around to see if
anyone has any more thoughts.

Really need to focus on my day time job - its worload has been 
climbing ... but can't help to place a quick comment here ..

The 900 MB/s vs. 850 KB/s difference looks like a caching  issue - 
that is, for 900 MB/s, it looks like the data was still lingering in 
the system cache while in 850 KB/s case, the data might already hit 
disk. Cluster filesystem normally syncs more by its nature. In 
general, ext3 does perform better in single node environment but the 
difference should not be as big as above.
There are certainly more tuning knobs available (such as journal size 
and/or network buffer size) to make GFS-iscsi "dd" run better but it 
is pointless. To deploy a cluster filesystem for production usage, the 
tuning should not be driven by such a simple-mind command. You also 
have to consider the support issues when deploying a filesystem. GFS1 
is a little bit out of date and any new development and/or significant 
performance improvements would likely be in GFS2, not in GFS1. 
Research GFS2 (googling to see how other people said about it) to 
understand whether its direction fits your need (so you can migrate 
from GFS1 to GFS2 if you bump into any show stopper in the future). If 
not, ext3 (with ext4 actively developed) is a fine choice if I read 
your configuration right from previous posts.

Or .. there is a known GFS1 writepage issue if most of your files are 
all very big .. The problem is fixed in RHEL kernels though. What is 
your kernel version ?

-- Wendy

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster