On Sun, Sep 22, 2013 at 07:40:24AM -0500, Mark Nelson wrote: > On 09/22/2013 03:12 AM, Quenten Grasso wrote: > > > >Hi All, > > > >I’m finding my write performance is less than I would have > >expected. After spending some considerable amount of time testing > >several different configurations I can never seems to break over > >~360mb/s write even when using tmpfs for journaling. > > > >So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB > >SAS & 2 x 100GB Intel DC S3700 SSD’s & 32GB Ram with the Perc > >H710p Raid controller and Dual Port 10GBE Network Cards. > > > >So first up I realise the SSD’s were a mistake, I should have > >bought the 200GB Ones as they have considerably better write > >though put of ~375 Mb/s vs 200 Mb/s > > > >So to our Nodes Configuration, > > > >2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks > >in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB > >Stripe size, > > > >(Stripe size this part was particularly important because I found > >the stripe size matters considerably even on a single disk raid0. > >contrary to what you might read on the internet) > > > >Also each disk is configured with (write back cache) is enabled > >and (read head) disabled. > > > >For Networking, All nodes are connected via LACP bond with L3 > >hashing and using iperf I can get up to 16gbit/s tx and rx between > >the nodes. > > > >OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to > >upgrade kernel due to 10Gbit Intel NIC’s driver issues) > > > >So this gives me 11 OSD’s & 2 SSD’s Per Node. > > > > I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but > you definitely will want to do some investigation to make sure that > OSD isn't holding the other ones back. iostat or collectl might be > useful, along with the ceph osd admin socket and the > dump_ops_in_flight and dump_historic_ops commands. > I was wondering if latency on the network was OK, I wondered if there was some kind of LACP-bonding not working correctly or L3 hashing cashing problems. ifstat or iptraf or graphs of SNMP from the switch might show you where the traffic went. I did my last tests I with nping echo-mode from nmap with --rate to do latency tests under network load. It does generate a lot of output which slows it down a bit you might want to redirect it somewhere. > >Next I’ve tried several different configurations which I’ll > >briefly describe 2 of which below, > > > >1)Cluster Configuration 1, > > > >33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD. > > > ># ceph osd pool create benchmark1 1800 1800 > > > ># rados bench -p benchmark1 180 write --no-cleanup > > > >-------------------------------------------------- > > > >Maintaining 16 concurrent writes of 4194304 bytes for up to 180 > >seconds or 0 objects > > > >Total time run: 180.250417 > > > >Total writes made: 10152 > > > >Write size: 4194304 > > > >Bandwidth (MB/sec): 225.287 > > > >Stddev Bandwidth: 35.0897 > > > >Max bandwidth (MB/sec): 312 > > > >Min bandwidth (MB/sec): 0 > > > >Average Latency: 0.284054 > > > >Stddev Latency: 0.199075 > > > >Max latency: 1.46791 > > > >Min latency: 0.038512 > > > >-------------------------------------------------- > > > > What was your pool replication set to? > > ># rados bench -p benchmark1 180 seq > > > >------------------------------------------------- > > > >Total time run: 43.782554 > > > >Total reads made: 10120 > > > >Read size: 4194304 > > > >Bandwidth (MB/sec): 924.569 > > > >Average Latency: 0.0691903 > > > >Max latency: 0.262542 > > > >Min latency: 0.015756 > > > >------------------------------------------------- > > > >In this configuration I found my write performance suffers a lot > >to the SSD’s seem to be a bottleneck and my write performance > >using rados bench was around 224-230mb/s > > > >2)Cluster Configuration 2, > > > >33 OSD’s with 1Gbyte Journals on tmpfs. > > > ># ceph osd pool create benchmark1 1800 1800 > > > ># rados bench -p benchmark1 180 write --no-cleanup > > > >-------------------------------------------------- > > > >Maintaining 16 concurrent writes of 4194304 bytes for up to 180 > >seconds or 0 objects > > > >Total time run: 180.044669 > > > >Total writes made: 15328 > > > >Write size: 4194304 > > > >Bandwidth (MB/sec): 340.538 > > > >Stddev Bandwidth: 26.6096 > > > >Max bandwidth (MB/sec): 380 > > > >Min bandwidth (MB/sec): 0 > > > >Average Latency: 0.187916 > > > >Stddev Latency: 0.0102989 > > > >Max latency: 0.336581 > > > >Min latency: 0.034475 > > > >-------------------------------------------------- > > > > Definitely low, especially with journals on tmpfs. :( How are the I'm no expert, but I did notice the tmpfs journals were only 1GB that seems kinda small. But the systems didn't have a lot more memory, so there wasn't much choice. Even if you make them slightly larger it will cut into the memory available for the filesystem cache. That might be a bad idea as well, I guess. > CPUs doing at this point? We have some R515s in our lab, and they > definitely are slow too. Ours have 7 OSD disks and 1 Dell branded > SSD (usually unused) each and can do about ~150MB/s writes per > system. It's actually a puzzle we've been trying to solve for quite > some time. > > Some thoughts: > > Could the expander backplane be having issues due to having to > tunnel STP for the SATA SSDs (or potentially be causing expander > wide resets)? Could the H700 (and apparently H710) be doing > something unusual that the stock LSI firmware handles better? We > replaced the H700 with an Areca 1880 and definitely saw changes in > performance (better large IO throughput and worse IOPS), but the > performance was still much lower than in a supermicro node with no > expanders in the backplane using either an LSI 2208 or Areca 1880. > > Things you might want to try: > > - single node tests, and if you have an alternate controller you can > try, seeing if that works better. > - removing the S3700s from the chassis entirely and retry the tmpfs > journal tests. > - Since the H710 is SAS2208 based, you may be able to use megacli to > set it into JBOD mode and see if that works any better (it may if > you are using SSD or tmpfs backed journals). > > MegaCli -AdpSetProp -EnableJBOD -val -aN|-a0,1,2|-aALL > MegaCli -PDMakeJBOD -PhysDrv[E0:S0,E1:S1,...] -aN|-a0,1,2|-aALL > I think I remember seeing in a presentation from Dreamhost they mentioned for their Ceph installtion they replaced the Dell firmware with original LSI firmware to solve some problems. Maybe that is a route that is also possible ? (that is at your own risk obviously. I really don't know if that is possible with this controller, don't blame me if you brick your controller !) > ># rados bench -p benchmark1 180 seq > > > >------------------------------------------------- > > > >Total time run: 76.481303 > > > >Total reads made: 15328 > > > >Read size: 4194304 > > > >Bandwidth (MB/sec): 801.660 > > > >Average Latency: 0.079814 > > > >Max latency: 0.317827 > > > >Min latency: 0.016857 > > > >------------------------------------------------- > > > >Now it seems there is no bottleneck for journaling as we are using > >tmpfs, however still less then what I would expect write speed the > >sas disks are barely busy via iostat.. > > > >So I thought it might be a disk bus throughput issue. > > > >Next I completed some dd tests… > > > >This below is in a script dd-x.sh which executes the 11 readers or > >writers at once. > > > >dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=32k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=32k count=100k > >oflag=direct & > > > >this gives me aggregated write throughput of around 1,135 MB/s Write. > > > >Simular script now to test reads, > > > >dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=32k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=32k count=100k > >iflag=direct & > > > >this gives me aggregated read throughput of around 1,382 MB/s Read. > > > >Next I’ll lower the block size to show the results, > > > >dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=4k count=100k oflag=direct & > > > >dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=4k count=100k oflag=direct & > > > >this gives me aggregated write throughput of around 300 MB/s Write. > > > >dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=4k count=100k iflag=direct & > > > >this gives me aggregated read throughput of around 430 MB/s Read, > > > >This is my ceph.conf, only difference between the configs is the > >journal dio = false > > > >---------------- > > > >[global] > > > >auth cluster required = cephx > > > >auth service required = cephx > > > >auth client required = cephx > > > >public network = 10.100.96.0/24 > > > >cluster network = 10.100.128.0/24 > > > >journal dio = false > > > >[mon] > > > >mon data = /var/ceph/mon.$id > > > >[mon.a] > > > >host = rbd01 > > > >mon addr = 10.100.96.10:6789 > > > >[mon.b] > > > >host = rbd02 > > > >mon addr = 10.100.96.11:6789 > > > >[mon.c] > > > >host = rbd03 > > > >mon addr = 10.100.96.12:6789 > > > >[osd] > > > >osd data = /srv/ceph/osd.$id > > > >osd journal size = 1000 > > > >osd mkfs type = xfs > > > >osd mkfs options xfs = "-f" > > > >osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k" > > > >[osd.0] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sda5 > > > >[osd.1] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdb2 > > > >[osd.2] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdc2 > > > >[osd.3] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdd2 > > > >[osd.4] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sde2 > > > >[osd.5] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdf2 > > > >[osd.6] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdg2 > > > >[osd.7] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdh2 > > > >[osd.8] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdi2 > > > >[osd.9] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdj2 > > > >[osd.10] > > > >host = rbd01 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdk2 > > > >[osd.11] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sda5 > > > >[osd.12] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdb2 > > > >[osd.13] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdc2 > > > >[osd.14] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdd2 > > > >[osd.15] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sde2 > > > >[osd.16] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdf2 > > > >[osd.17] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdg2 > > > >[osd.18] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdh2 > > > >[osd.19] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdi2 > > > >[osd.20] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdj2 > > > >[osd.21] > > > >host = rbd02 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdk2 > > > >[osd.22] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sda5 > > > >[osd.23] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdb2 > > > >[osd.24] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdc2 > > > >[osd.25] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdd2 > > > >[osd.26] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sde2 > > > >[osd.27] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdf2 > > > >[osd.28] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdg2 > > > >[osd.29] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdh2 > > > >[osd.30] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdi2 > > > >[osd.31] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdj2 > > > >[osd.32] > > > >host = rbd03 > > > >osd journal = /tmp/tmpfs/osd.$id > > > >devs = /dev/sdk2 > > > >--------------------- > > > >Any Ideas? > > > >Cheers, > > > >Quenten > > > > > > > >_______________________________________________ > >ceph-users mailing list > >ceph-users@xxxxxxxxxxxxxx > >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html