which is a stripe of the gluster storage servers, this is the performance I get (note use a file size > amount of RAM on client and server systems, 13GB in this case) : 4k block size : 111 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -y pir4: Write test (dd): 142.281 MB/s 1138.247 mbps 93.561 seconds pir4: Read test (dd): 274.321 MB/s 2194.570 mbps 48.527 seconds testing from 8k - 128k block size on the dd, best performance was achieved at 64k block sizes: 114 pir4:/pirstripe% /sb/admin/scripts/nfsSpeedTest -s 13g -b 64k -y pir4: Write test (dd): 213.344 MB/s 1706.750 mbps 62.397 seconds pir4: Read test (dd): 955.328 MB/s 7642.620 mbps 13.934 seconds This is to the /pirdist directories which are mounted in distribute mode (file is written to only one of the gluster servers) : 105 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y pir4: Write test (dd): 182.410 MB/s 1459.281 mbps 72.978 seconds pir4: Read test (dd): 244.379 MB/s 1955.033 mbps 54.473 seconds 106 pir4:/pirdist% /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k pir4: Write test (dd): 204.297 MB/s 1634.375 mbps 65.160 seconds pir4: Read test (dd): 340.427 MB/s 2723.419 mbps 39.104 seconds For reference/control, here's the same test writing straight to the XFS filesystem on one of the gluster storage nodes: [sabujp at gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y gluster1: Write test (dd): 398.971 MB/s 3191.770 mbps 33.366 seconds gluster1: Read test (dd): 234.563 MB/s 1876.501 mbps 56.752 seconds [sabujp at gluster1 tmp]$ /sb/admin/scripts/nfsSpeedTest -s 13g -y -b 64k gluster1: Write test (dd): 442.251 MB/s 3538.008 mbps 30.101 seconds gluster1: Read test (dd): 219.708 MB/s 1757.660 mbps 60.590 seconds The read test seems to scale linearly with the # of storage servers (almost 1GB/s!). Interestingly, the /pirdist read test at 64k block size was 120MB/s faster than the read test straight from XFS, however, it could have been that gluster1 was busy and when I read from /pirdist the file was actually being read from one of the other 4 less busy storage nodes. Here's our storage node setup (many of these settings may not apply to v3.2= ) : #### volume posix-stripe type storage/posix option directory /export/gluster1/stripe end-volume volume posix-distribute type storage/posix option directory /export/gluster1/distribute end-volume volume locks type features/locks subvolumes posix-stripe end-volume volume locks-dist type features/locks subvolumes posix-distribute end-volume volume iothreads type performance/io-threads option thread-count 16 subvolumes locks end-volume volume iothreads-dist type performance/io-threads option thread-count 16 subvolumes locks-dist end-volume volume server type protocol/server option transport-type ib-verbs option auth.addr.iothreads.allow 10.2.178.* option auth.addr.iothreads-dist.allow 10.2.178.* option auth.addr.locks.allow 10.2.178.* option auth.addr.posix-stripe.allow 10.2.178.* subvolumes iothreads iothreads-dist locks posix-stripe end-volume #### Here's our stripe client setup : #### volume client-stripe-1 type protocol/client option transport-type ib-verbs option remote-host gluster1 option remote-subvolume iothreads end-volume volume client-stripe-2 type protocol/client option transport-type ib-verbs option remote-host gluster2 option remote-subvolume iothreads end-volume volume client-stripe-3 type protocol/client option transport-type ib-verbs option remote-host gluster3 option remote-subvolume iothreads end-volume volume client-stripe-4 type protocol/client option transport-type ib-verbs option remote-host gluster4 option remote-subvolume iothreads end-volume volume client-stripe-5 type protocol/client option transport-type ib-verbs option remote-host gluster5 option remote-subvolume iothreads end-volume volume readahead-gluster1 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-stripe-1 end-volume volume readahead-gluster2 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-stripe-2 end-volume volume readahead-gluster3 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-stripe-3 end-volume volume readahead-gluster4 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-stripe-4 end-volume =09 volume readahead-gluster5 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-stripe-5 end-volume volume writebehind-gluster1 type performance/write-behind option flush-behind on subvolumes readahead-gluster1 end-volume volume writebehind-gluster2 type performance/write-behind option flush-behind on subvolumes readahead-gluster2 end-volume volume writebehind-gluster3 type performance/write-behind option flush-behind on subvolumes readahead-gluster3 end-volume volume writebehind-gluster4 type performance/write-behind option flush-behind on subvolumes readahead-gluster4 end-volume volume writebehind-gluster5 type performance/write-behind option flush-behind on subvolumes readahead-gluster5 end-volume volume quick-read-gluster1 type performance/quick-read subvolumes writebehind-gluster1 end-volume volume quick-read-gluster2 type performance/quick-read subvolumes writebehind-gluster2 end-volume volume quick-read-gluster3 type performance/quick-read subvolumes writebehind-gluster3 end-volume volume quick-read-gluster4 type performance/quick-read subvolumes writebehind-gluster4 end-volume volume quick-read-gluster5 type performance/quick-read subvolumes writebehind-gluster5 end-volume volume stat-prefetch-gluster1 type performance/stat-prefetch #subvolumes quick-read-gluster1 subvolumes writebehind-gluster1 end-volume volume stat-prefetch-gluster2 type performance/stat-prefetch #subvolumes quick-read-gluster2 subvolumes writebehind-gluster2 end-volume volume stat-prefetch-gluster3 type performance/stat-prefetch #subvolumes quick-read-gluster3 subvolumes writebehind-gluster3 end-volume volume stat-prefetch-gluster4 type performance/stat-prefetch #subvolumes quick-read-gluster4 subvolumes writebehind-gluster4 end-volume volume stat-prefetch-gluster5 type performance/stat-prefetch #subvolumes quick-read-gluster5 subvolumes writebehind-gluster5 end-volume volume stripe type cluster/stripe option block-size 2MB #subvolumes client-stripe-1 client-stripe-2 client-stripe-3 client-stripe-4 client-stripe-5 #subvolumes readahead-gluster1 readahead-gluster2 readahead-gluster3 readahead-gluster4 readahead-gluster5 #subvolumes writebehind-gluster1 writebehind-gluster2 writebehind-gluster3 writebehind-gluster4 writebehind-gluster5 #subvolumes quick-read-gluster1 quick-read-gluster2 quick-read-gluster3 quick-read-gluster4 quick-read-gluster5 subvolumes stat-prefetch-gluster1 stat-prefetch-gluster2 stat-prefetch-gluster3 stat-prefetch-gluster4 stat-prefetch-gluster5 end-volume #### Quick read was disabled because there was a bug that causes a crash when that's enabled. This has been fixed in more recent versions but I haven't upgraded. Here's our client distribute setup : #### volume client-distribute-1 type protocol/client option transport-type ib-verbs option remote-host gluster1 option remote-subvolume iothreads-dist end-volume volume client-distribute-2 type protocol/client option transport-type ib-verbs option remote-host gluster2 option remote-subvolume iothreads-dist end-volume volume client-distribute-3 type protocol/client option transport-type ib-verbs option remote-host gluster3 option remote-subvolume iothreads-dist end-volume volume client-distribute-4 type protocol/client option transport-type ib-verbs option remote-host gluster4 option remote-subvolume iothreads-dist end-volume volume client-distribute-5 type protocol/client option transport-type ib-verbs option remote-host gluster5 option remote-subvolume iothreads-dist end-volume volume readahead-gluster1 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-distribute-1 end-volume volume readahead-gluster2 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-distribute-2 end-volume volume readahead-gluster3 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-distribute-3 end-volume volume readahead-gluster4 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-distribute-4 end-volume =09 volume readahead-gluster5 type performance/read-ahead option page-count 4 # 2 is default option force-atime-update off # default is off subvolumes client-distribute-5 end-volume volume writebehind-gluster1 type performance/write-behind option flush-behind on subvolumes readahead-gluster1 end-volume volume writebehind-gluster2 type performance/write-behind option flush-behind on subvolumes readahead-gluster2 end-volume volume writebehind-gluster3 type performance/write-behind option flush-behind on subvolumes readahead-gluster3 end-volume volume writebehind-gluster4 type performance/write-behind option flush-behind on subvolumes readahead-gluster4 end-volume volume writebehind-gluster5 type performance/write-behind option flush-behind on subvolumes readahead-gluster5 end-volume volume quick-read-gluster1 type performance/quick-read subvolumes writebehind-gluster1 end-volume volume quick-read-gluster2 type performance/quick-read subvolumes writebehind-gluster2 end-volume volume quick-read-gluster3 type performance/quick-read subvolumes writebehind-gluster3 end-volume volume quick-read-gluster4 type performance/quick-read subvolumes writebehind-gluster4 end-volume volume quick-read-gluster5 type performance/quick-read subvolumes writebehind-gluster5 end-volume volume stat-prefetch-gluster1 type performance/stat-prefetch subvolumes quick-read-gluster1 end-volume volume stat-prefetch-gluster2 type performance/stat-prefetch subvolumes quick-read-gluster2 end-volume volume stat-prefetch-gluster3 type performance/stat-prefetch subvolumes quick-read-gluster3 end-volume volume stat-prefetch-gluster4 type performance/stat-prefetch subvolumes quick-read-gluster4 end-volume volume stat-prefetch-gluster5 type performance/stat-prefetch subvolumes quick-read-gluster5 end-volume volume distribute type cluster/distribute #option block-size 2MB #subvolumes client-distribute-1 client-distribute-2 client-distribute-3 client-distribute-4 client-distribute-5 option min-free-disk 1% #subvolumes writebehind-gluster1 writebehind-gluster2 writebehind-gluster3 writebehind-gluster4 writebehind-gluster5 subvolumes stat-prefetch-gluster1 stat-prefetch-gluster2 stat-prefetch-gluster3 stat-prefetch-gluster4 stat-prefetch-gluster5 end-volume #### I don't know why my writes are so slow compared to reads. Let me know if you're able to get better write speeds with the newer version of gluster and any of the configurations (if they apply) that I've posted. It might compel me to upgrade. HTH, Sabuj Pattanayek > For some background, our compute cluster has 64 compute nodes. The gluste= r > storage pool has 10 Dell PowerEdge R515 servers, each with 12 x 2 TB disk= s. > We have another 16 Dell PowerEdge R515s used as Lustre storage servers. T= he > compute and storage nodes are all connected via QDR Infiniband. Both Glus= ter > and Lustre are set to use RDMA over Infiniband. We are using OFED version > 1.5.2-20101219, Gluster 3.2.2 and CentOS 5.5 on both the compute and stor= age > nodes. > > Oddly, it seems like there's some sort of bottleneck on the client side -= - > for example, we're only seeing about 50 MB/s write throughput from a sing= le > compute node when writing a 10GB file. But, if we run multiple simultaneo= us > writes from multiple compute nodes to the same Gluster volume, we get 50 > MB/s from each compute node. However, running multiple writes from the sa= me > compute node does not increase throughput. The compute nodes have 48 core= s > and 128 GB RAM, so I don't think the issue is with the compute node > hardware. > > With Lustre, on the same hardware, with the same version of OFED, we're > seeing write throughput on that same 10 GB file as follows: 476 MB/s sing= le > stream write from a single compute node and aggregate performance of more > like 2.4 GB/s if we run simultaneous writes. That leads me to believe tha= t > we don't have a problem with RDMA, otherwise Lustre, which is also using > RDMA, should be similarly affected. > > We have tried both xfs and ext4 for the backend file system on the Gluste= r > storage nodes (we're currently using ext4). We went with distributed (not > distributed striped) for the Gluster volume -- the thought was that if th= ere > was a catastrophic failure of one of the storage nodes, we'd only lose th= e > data on that node; presumably with distributed striped you'd lose any dat= a > striped across that volume, unless I have misinterpreted the documentatio= n. > > So ... what's expected/normal throughput for Gluster over QDR IB to a > relatively large storage pool (10 servers / 120 disks)? Does anyone have > suggested tuning tips for improving performance? > > Thanks! > > John > > -- > > ________________________________________________________ > > John Lalande > University of Wisconsin-Madison > Space Science & Engineering Center > 1225 W. Dayton Street, Room 439, Madison, WI 53706 > 608-263-2268=A0/ john.lalande at ssec.wisc.edu > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >