Re: [ceph-users] Ceph write performance and my Dell R515's

Leen Besselink <leen@xxxxxxxxxxxxxxxxx> · Sun, 22 Sep 2013 14:58:21 +0200

On Sun, Sep 22, 2013 at 07:40:24AM -0500, Mark Nelson wrote:
> On 09/22/2013 03:12 AM, Quenten Grasso wrote:
> >
> >Hi All,
> >
> >I’m finding my write performance is less than I would have
> >expected. After spending some considerable amount of time testing
> >several different configurations I can never seems to break over
> >~360mb/s write even when using tmpfs for journaling.
> >
> >So I’ve purchased 3x Dell R515’s with 1 x AMD 6C CPU with 12 x 3TB
> >SAS & 2 x 100GB Intel DC S3700 SSD’s & 32GB Ram with the Perc
> >H710p Raid controller and Dual Port 10GBE Network Cards.
> >
> >So first up I realise the SSD’s were a mistake, I should have
> >bought the 200GB Ones as they have considerably better write
> >though put of ~375 Mb/s vs 200 Mb/s
> >
> >So to our Nodes Configuration,
> >
> >2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks
> >in a Single each in a Raid0 (like a JBOD Fashion) with a 1MB
> >Stripe size,
> >
> >(Stripe size this part was particularly important because I found
> >the stripe size matters considerably even on a single disk raid0.
> >contrary to what you might read on the internet)
> >
> >Also each disk is configured with (write back cache) is enabled
> >and (read head) disabled.
> >
> >For Networking, All nodes are connected via LACP bond with L3
> >hashing and using iperf I can get up to 16gbit/s tx and rx between
> >the nodes.
> >
> >OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to
> >upgrade kernel due to 10Gbit Intel NIC’s driver issues)
> >
> >So this gives me 11 OSD’s & 2 SSD’s Per Node.
> >
> 
> I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but
> you definitely will want to do some investigation to make sure that
> OSD isn't holding the other ones back. iostat or collectl might be
> useful, along with the ceph osd admin socket and the
> dump_ops_in_flight and dump_historic_ops commands.
> 

I was wondering if latency on the network was OK, I wondered if there was some kind
of LACP-bonding not working correctly or L3 hashing cashing problems.

ifstat or iptraf or graphs of SNMP from the switch might show you where the traffic went.

I did my last tests I with nping echo-mode from nmap with --rate to do latency tests under network load.

It does generate a lot of output which slows it down a bit you might want to redirect it somewhere.

> >Next I’ve tried several different configurations which I’ll
> >briefly describe 2 of which below,
> >
> >1)Cluster Configuration 1,
> >
> >33 OSD’s with 6x SSD’s as Journals, w/ 15GB Journals on SSD.
> >
> ># ceph osd pool create benchmark1 1800 1800
> >
> ># rados bench -p benchmark1 180 write --no-cleanup
> >
> >--------------------------------------------------
> >
> >Maintaining 16 concurrent writes of 4194304 bytes for up to 180
> >seconds or 0 objects
> >
> >Total time run: 180.250417
> >
> >Total writes made: 10152
> >
> >Write size: 4194304
> >
> >Bandwidth (MB/sec): 225.287
> >
> >Stddev Bandwidth: 35.0897
> >
> >Max bandwidth (MB/sec): 312
> >
> >Min bandwidth (MB/sec): 0
> >
> >Average Latency: 0.284054
> >
> >Stddev Latency: 0.199075
> >
> >Max latency: 1.46791
> >
> >Min latency: 0.038512
> >
> >--------------------------------------------------
> >
> 
> What was your pool replication set to?
> 
> ># rados bench -p benchmark1 180 seq
> >
> >-------------------------------------------------
> >
> >Total time run: 43.782554
> >
> >Total reads made: 10120
> >
> >Read size: 4194304
> >
> >Bandwidth (MB/sec): 924.569
> >
> >Average Latency: 0.0691903
> >
> >Max latency: 0.262542
> >
> >Min latency: 0.015756
> >
> >-------------------------------------------------
> >
> >In this configuration I found my write performance suffers a lot
> >to the SSD’s seem to be a bottleneck and my write performance
> >using rados bench was around 224-230mb/s
> >
> >2)Cluster Configuration 2,
> >
> >33 OSD’s with 1Gbyte Journals on tmpfs.
> >
> ># ceph osd pool create benchmark1 1800 1800
> >
> ># rados bench -p benchmark1 180 write --no-cleanup
> >
> >--------------------------------------------------
> >
> >Maintaining 16 concurrent writes of 4194304 bytes for up to 180
> >seconds or 0 objects
> >
> >Total time run: 180.044669
> >
> >Total writes made: 15328
> >
> >Write size: 4194304
> >
> >Bandwidth (MB/sec): 340.538
> >
> >Stddev Bandwidth: 26.6096
> >
> >Max bandwidth (MB/sec): 380
> >
> >Min bandwidth (MB/sec): 0
> >
> >Average Latency: 0.187916
> >
> >Stddev Latency: 0.0102989
> >
> >Max latency: 0.336581
> >
> >Min latency: 0.034475
> >
> >--------------------------------------------------
> >
> 
> Definitely low, especially with journals on tmpfs. :( How are the

I'm no expert, but I did notice the tmpfs journals were only 1GB that
seems kinda small. But the systems didn't have a lot more memory, so there
wasn't much choice.

Even if you make them slightly larger it will cut into the memory available
for the filesystem cache. That might be a bad idea as well, I guess.

> CPUs doing at this point? We have some R515s in our lab, and they
> definitely are slow too. Ours have 7 OSD disks and 1 Dell branded
> SSD (usually unused) each and can do about ~150MB/s writes per
> system. It's actually a puzzle we've been trying to solve for quite
> some time.
> 
> Some thoughts:
> 
> Could the expander backplane be having issues due to having to
> tunnel STP for the SATA SSDs (or potentially be causing expander
> wide resets)? Could the H700 (and apparently H710) be doing
> something unusual that the stock LSI firmware handles better? We
> replaced the H700 with an Areca 1880 and definitely saw changes in
> performance (better large IO throughput and worse IOPS), but the
> performance was still much lower than in a supermicro node with no
> expanders in the backplane using either an LSI 2208 or Areca 1880.
> 
> Things you might want to try:
> 
> - single node tests, and if you have an alternate controller you can
> try, seeing if that works better.
> - removing the S3700s from the chassis entirely and retry the tmpfs
> journal tests.
> - Since the H710 is SAS2208 based, you may be able to use megacli to
> set it into JBOD mode and see if that works any better (it may if
> you are using SSD or tmpfs backed journals).
> 
> MegaCli -AdpSetProp -EnableJBOD -val -aN|-a0,1,2|-aALL
> MegaCli -PDMakeJBOD -PhysDrv[E0:S0,E1:S1,...] -aN|-a0,1,2|-aALL
> 

I think I remember seeing in a presentation from Dreamhost they mentioned
for their Ceph installtion they replaced the Dell firmware with original LSI
firmware to solve some problems. Maybe that is a route that is also possible ?

(that is at your own risk obviously. I really don't know if that is possible
with this controller, don't blame me if you brick your controller !)

> ># rados bench -p benchmark1 180 seq
> >
> >-------------------------------------------------
> >
> >Total time run: 76.481303
> >
> >Total reads made: 15328
> >
> >Read size: 4194304
> >
> >Bandwidth (MB/sec): 801.660
> >
> >Average Latency: 0.079814
> >
> >Max latency: 0.317827
> >
> >Min latency: 0.016857
> >
> >-------------------------------------------------
> >
> >Now it seems there is no bottleneck for journaling as we are using
> >tmpfs, however still less then what I would expect write speed the
> >sas disks are barely busy via iostat..
> >
> >So I thought it might be a disk bus throughput issue.
> >
> >Next I completed some dd tests…
> >
> >This below is in a script dd-x.sh which executes the 11 readers or
> >writers at once.
> >
> >dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=32k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=32k count=100k
> >oflag=direct &
> >
> >this gives me aggregated write throughput of around 1,135 MB/s Write.
> >
> >Simular script now to test reads,
> >
> >dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=32k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=32k count=100k
> >iflag=direct &
> >
> >this gives me aggregated read throughput of around 1,382 MB/s Read.
> >
> >Next I’ll lower the block size to show the results,
> >
> >dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=4k count=100k oflag=direct &
> >
> >dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=4k count=100k oflag=direct &
> >
> >this gives me aggregated write throughput of around 300 MB/s Write.
> >
> >dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=4k count=100k iflag=direct &
> >
> >this gives me aggregated read throughput of around 430 MB/s Read,
> >
> >This is my ceph.conf, only difference between the configs is the
> >journal dio = false
> >
> >----------------
> >
> >[global]
> >
> >auth cluster required = cephx
> >
> >auth service required = cephx
> >
> >auth client required = cephx
> >
> >public network = 10.100.96.0/24
> >
> >cluster network = 10.100.128.0/24
> >
> >journal dio = false
> >
> >[mon]
> >
> >mon data = /var/ceph/mon.$id
> >
> >[mon.a]
> >
> >host = rbd01
> >
> >mon addr = 10.100.96.10:6789
> >
> >[mon.b]
> >
> >host = rbd02
> >
> >mon addr = 10.100.96.11:6789
> >
> >[mon.c]
> >
> >host = rbd03
> >
> >mon addr = 10.100.96.12:6789
> >
> >[osd]
> >
> >osd data = /srv/ceph/osd.$id
> >
> >osd journal size = 1000
> >
> >osd mkfs type = xfs
> >
> >osd mkfs options xfs = "-f"
> >
> >osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k"
> >
> >[osd.0]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sda5
> >
> >[osd.1]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdb2
> >
> >[osd.2]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdc2
> >
> >[osd.3]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdd2
> >
> >[osd.4]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sde2
> >
> >[osd.5]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdf2
> >
> >[osd.6]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdg2
> >
> >[osd.7]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdh2
> >
> >[osd.8]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdi2
> >
> >[osd.9]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdj2
> >
> >[osd.10]
> >
> >host = rbd01
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdk2
> >
> >[osd.11]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sda5
> >
> >[osd.12]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdb2
> >
> >[osd.13]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdc2
> >
> >[osd.14]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdd2
> >
> >[osd.15]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sde2
> >
> >[osd.16]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdf2
> >
> >[osd.17]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdg2
> >
> >[osd.18]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdh2
> >
> >[osd.19]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdi2
> >
> >[osd.20]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdj2
> >
> >[osd.21]
> >
> >host = rbd02
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdk2
> >
> >[osd.22]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sda5
> >
> >[osd.23]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdb2
> >
> >[osd.24]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdc2
> >
> >[osd.25]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdd2
> >
> >[osd.26]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sde2
> >
> >[osd.27]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdf2
> >
> >[osd.28]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdg2
> >
> >[osd.29]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdh2
> >
> >[osd.30]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdi2
> >
> >[osd.31]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdj2
> >
> >[osd.32]
> >
> >host = rbd03
> >
> >osd journal = /tmp/tmpfs/osd.$id
> >
> >devs = /dev/sdk2
> >
> >---------------------
> >
> >Any Ideas?
> >
> >Cheers,
> >
> >Quenten
> >
> >
> >
> >_______________________________________________
> >ceph-users mailing list
> >ceph-users@xxxxxxxxxxxxxx
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html