Michael, For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please? I have a vague hunch that the problem may have something to do with coalescing or not of IO operations. Also, which IO scheduler are you using? Thanks abnd regards, Chris Jankowski -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Michael Lackner Sent: Tuesday, 15 June 2010 00:22 To: linux clustering Subject: Re: GFS (1 & partially 2) performance problems Hello! Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid. I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node. I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS). The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose. I have yet to do some serious read testing on GFS2. I have aborted my GFS2 tests as write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers. After each write test I did a "sync" to flush everything to disks. I did not do this before or after read tests though.. As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?). As for the direct I/O tests, by that you mean testing without ANY caching going on, a synchronous write? What I did before was test EXT3 (~190MB/s) and XFS (~320MB/s) on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself.. I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here. Thanks! Steven Whitehouse wrote: > Hi, > > On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote: > >> Hello! >> >> I am currently building a Cluster sitting on CentOS 5 for GFS usage. >> >> At the moment, the storage subsystem consists of an HP MSA2312 >> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines >> are connected to that switch over 8gbit FC. The disks themselves are >> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares. >> >> Now, the whole storage shall be shared (single filesystem), here GFS >> comes in. >> >> The Cluster is only 3 nodes large at the moment, more nodes will be >> added later on. I am currently testing GFS1 and GFS2 for performance. >> Lock Management is done over single 1Gbit Ethernet Links (1 per >> machine). >> >> Thing is, with GFS1 I get far better performance than with the newer >> GFS2 across the board, with a few tunable parameters set, for writes >> GFS1 is roughly twice as fast. >> >> > What tests are you running? GFS2 is generally faster than GFS1 except > for streaming writes, which is an area that we are putting some effort > into solving currently. Small writes (one fs block (4k default) or > less) on GFS2 are much faster than on GFS1. > > >> But, concurrent reads are totally abysmal. The total write >> performance (all nodes combined) sits around 280-330Mbyte/sec, >> whereas the READ performance is as low as 30-40Mbyte/sec when doing >> concurrent reads. Surprisingly, single-node read is somewhat ok at >> 180Mbyte/sec, but as soon as several nodes are reading from GFS >> (version 1 at the >> moment) at the same time, things turn ugly. >> >> > Reads on GFS2 should be much faster than GFS1, so it sounds as if > something isn't working correctly for some reason. For cached data, > reads on GFS2 should be as fast as ext2/3 since the code path is > identical (to the page cache) and only changes if pages are not cached. > GFS1 does its locking at a higher level, so there will be more > overhead for cached reads in general. > > Do make sure that if you are preparing the test files for reading all > from one node (or even just a different node to that on which you sre > running the read tests) that you need to sync them to disk on that > node before starting the tests to avoid issues with caching. > > >> This is strange, because for writes, global performance across the >> cluster increases slightly when adding more nodes. But for reads, the >> oppsite seems to be true. >> >> For read and write tests, separate testfiles were created and read >> for each node, with each testfile sitting in its own subdirectory, so >> no node would access another nodes file. >> >> > That sounds like a good test set up to me. > > >> GFS1 created with the following mkfs.gfs parameters: >> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm" >> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, Distributed >> LockManager) >> >> Mount Options set: "noatime,nodiratime,noquota" >> >> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, >> demote_secs 20" >> > You shouldn't normally need to set the glock_purge and demote_secs to > anything other than the default. These settings no longer exist in > GFS2 since it makes use of the shrinker subsystem provided by the VM > and is auto-tuning. If your workload is metadata heavy, you could try > boosting the journal size and/or the incore_log_blocks tunable. > > >> Also, in /etc/cluster/cluster.conf, I added this: >> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld >> plock_rate_limit="0"/> >> >> Any ideas on how to figure out what's going wrong, and how to tune >> GFS1 for better concurrent read performance, or tune GFS2 in general >> to be competitive/better than GFS1? >> >> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and >> somewhat good reaction times while under heavy sequential and/or >> random load. But for now, I just wanna get the seq reading to work >> acceptably fast. >> >> Thanks a lot for your help! >> >> > Can you try doing some I/O direct to the block device so that we can > get an idea of what the raw device can manage? Using dd both read and > write, across the nodes (different disk locations on each node to > simulate different files). > > I'm wondering if the problem might be due to the seek pattern > generated by the multiple read locations, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Michael Lackner Chair of Information Technology, University of Leoben IT Administration michael.lackner@xxxxxxxxxxxx | +43 (0)3842/402-1505 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster