Re: Ceph write performance and my Dell R515's

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Wed, 25 Sep 2013 18:55:57 -0500

On 09/25/2013 06:46 PM, Quenten Grasso wrote:
G'day Mark,

I stumbled across an older thread it looks like you were involved with the centos and poor seq write performance on the R515's.

Were you using centos or Ubuntu on your server at the time? (I'm wondering if this could be related to Ubuntu)

Our R515s have been running precise, but with various different kernels 
over the last year.

http://marc.info/?t=134819117000002&r=1&w=2

Also I tried as you suggested to put the raid controller into JBOD mode but no joy. I also tried cross flashing the card as its apparently a 9260 but we don't have
any spare slots outside of the storage slot of which the raid controller cables can reach so that was a non-event :(

If you want to give it a try, if you have access to longer cables and or other servers you can put the perc h700 into.

I downloaded this flashing kit from here, (has all of the tools) grabbed a freedos usb and copied it all onto that.

http://forums.laptopvideo2go.com/topic/29166-sas2108-lsi-9260-based-firmware-files/

Then grabbed the latest 9260 firmware from,

http://www.lsi.com/downloads/Public/MegaRAID%20Common%20Files/12.13.0-0154_SAS_2108_Fw_Image_APP2.130.383-2315.zip

LSI cards are kind of goofy.  There is apparently two different levels 
of flashing you can do, though the more low level one appears to be a 
much better kept secret and shrouded in mystery (at least to me).  I 
also flashed one of our H700s but sadly didn't see much change in 
performance.  Apparently that might not matter though if the lower level 
firmware is still Dell.  I admit this is all half hearsay so I have no 
idea how accurate it is.

We did end up putting an areca controller in the R515s and saw some 
improvement for large reads/writes (and with small IOs, sometimes worse 
performance!), so the controller is having an effect.  In both cases 
though, the system was quite a bit slower than a supermicro node with an 
intel processor and no expander backplane.  I suspect (though have not 
proven) that it has less to do with the CPU and more to do with the 
expander/drive/controller combination.

At some point I want to see if we can put the H700 in our supermicro 
node and see what happens, or get some breakout cables and an extra 
power supply and directly connect the drives to the controller in the R515.

*** Steps to Cross Flash ***
**** Disclaimer you do this at your own risk, I take no responsibility if you brick your card, Warranty, etc ****

In a Dell R515 If you write the SBR of a LSI card i.e. the 9260 and reboot the system, The system will be halted as it's now a non-dell card in the storage slot.
However if you attempt to flash the LSI firmware onto the perch700 without the correct SBR it won't flash correctly it seems.

So if you have longer cables and or another server to try the h700 in that's not a dell. You can try and cross flash the card.
(FYI if you're trying to do this in a dell and you fudge up you can recover your system/raid card by plugging it into another pci-e slot and reapplying the Dell H700 SBR/Firmware)

Now I'll assume you have one raid controller in your system so you only have adapter 0

1) Backup your SBR in case you need to restore it ie:

Megarec -readsbr 0 prch700.sbr

2) Write the SBR of the card you want to flash ie:

megarec -writesbr 0 sbr9260.bin

3) Erase the raid controller bios/firmware
Megarec -cleanflash 0

4) Reboot

5) flash new firmware
Megarec -m0flash 0 mr2108fw.rom

6) Reboot & Done.

Also if your command errors out half way through flashing/erasing run it again.

Regards,
Quenten Grasso

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Sunday, 22 September 2013 10:40 PM
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re:  Ceph write performance and my Dell R515's

On 09/22/2013 03:12 AM, Quenten Grasso wrote:

Hi All,

I'm finding my write performance is less than I would have expected.
After spending some considerable amount of time testing several
different configurations I can never seems to break over ~360mb/s
write even when using tmpfs for journaling.

So I've purchased 3x Dell R515's with 1 x AMD 6C CPU with 12 x 3TB SAS
& 2 x 100GB Intel DC S3700 SSD's & 32GB Ram with the Perc H710p Raid
controller and Dual Port 10GBE Network Cards.

So first up I realise the SSD's were a mistake, I should have bought
the 200GB Ones as they have considerably better write though put of
~375 Mb/s vs 200 Mb/s

So to our Nodes Configuration,

2 x 3TB disks in Raid1 for OS/MON & 1 partition for OSD, 12 Disks in a
Single each in a Raid0 (like a JBOD Fashion) with a 1MB Stripe size,

(Stripe size this part was particularly important because I found the
stripe size matters considerably even on a single disk raid0. contrary
to what you might read on the internet)

Also each disk is configured with (write back cache) is enabled and
(read head) disabled.

For Networking, All nodes are connected via LACP bond with L3 hashing
and using iperf I can get up to 16gbit/s tx and rx between the nodes.

OS: Ubuntu 12.04.3 LTS w/ Kernel 3.10.12-031012-generic (had to
upgrade kernel due to 10Gbit Intel NIC's driver issues)

So this gives me 11 OSD's & 2 SSD's Per Node.

I'm a bit leery about that 1 OSD on the RAID1. It may be fine, but you definitely will want to do some investigation to make sure that OSD isn't holding the other ones back. iostat or collectl might be useful, along with the ceph osd admin socket and the dump_ops_in_flight and dump_historic_ops commands.

Next I've tried several different configurations which I'll briefly
describe 2 of which below,

1)Cluster Configuration 1,

33 OSD's with 6x SSD's as Journals, w/ 15GB Journals on SSD.

# ceph osd pool create benchmark1 1800 1800

# rados bench -p benchmark1 180 write --no-cleanup

--------------------------------------------------

Maintaining 16 concurrent writes of 4194304 bytes for up to 180
seconds or 0 objects

Total time run: 180.250417

Total writes made: 10152

Write size: 4194304

Bandwidth (MB/sec): 225.287

Stddev Bandwidth: 35.0897

Max bandwidth (MB/sec): 312

Min bandwidth (MB/sec): 0

Average Latency: 0.284054

Stddev Latency: 0.199075

Max latency: 1.46791

Min latency: 0.038512

--------------------------------------------------

What was your pool replication set to?

# rados bench -p benchmark1 180 seq

-------------------------------------------------

Total time run: 43.782554

Total reads made: 10120

Read size: 4194304

Bandwidth (MB/sec): 924.569

Average Latency: 0.0691903

Max latency: 0.262542

Min latency: 0.015756

-------------------------------------------------

In this configuration I found my write performance suffers a lot to
the SSD's seem to be a bottleneck and my write performance using rados
bench was around 224-230mb/s

2)Cluster Configuration 2,

33 OSD's with 1Gbyte Journals on tmpfs.

# ceph osd pool create benchmark1 1800 1800

# rados bench -p benchmark1 180 write --no-cleanup

--------------------------------------------------

Maintaining 16 concurrent writes of 4194304 bytes for up to 180
seconds or 0 objects

Total time run: 180.044669

Total writes made: 15328

Write size: 4194304

Bandwidth (MB/sec): 340.538

Stddev Bandwidth: 26.6096

Max bandwidth (MB/sec): 380

Min bandwidth (MB/sec): 0

Average Latency: 0.187916

Stddev Latency: 0.0102989

Max latency: 0.336581

Min latency: 0.034475

--------------------------------------------------

Definitely low, especially with journals on tmpfs. :( How are the CPUs doing at this point? We have some R515s in our lab, and they definitely are slow too. Ours have 7 OSD disks and 1 Dell branded SSD (usually
unused) each and can do about ~150MB/s writes per system. It's actually a puzzle we've been trying to solve for quite some time.

Some thoughts:

Could the expander backplane be having issues due to having to tunnel STP for the SATA SSDs (or potentially be causing expander wide resets)?
Could the H700 (and apparently H710) be doing something unusual that the stock LSI firmware handles better? We replaced the H700 with an Areca
1880 and definitely saw changes in performance (better large IO throughput and worse IOPS), but the performance was still much lower than in a supermicro node with no expanders in the backplane using either an LSI 2208 or Areca 1880.

Things you might want to try:

- single node tests, and if you have an alternate controller you can try, seeing if that works better.
- removing the S3700s from the chassis entirely and retry the tmpfs journal tests.
- Since the H710 is SAS2208 based, you may be able to use megacli to set it into JBOD mode and see if that works any better (it may if you are using SSD or tmpfs backed journals).

MegaCli -AdpSetProp -EnableJBOD -val -aN|-a0,1,2|-aALL MegaCli -PDMakeJBOD -PhysDrv[E0:S0,E1:S1,...] -aN|-a0,1,2|-aALL

# rados bench -p benchmark1 180 seq

-------------------------------------------------

Total time run: 76.481303

Total reads made: 15328

Read size: 4194304

Bandwidth (MB/sec): 801.660

Average Latency: 0.079814

Max latency: 0.317827

Min latency: 0.016857

-------------------------------------------------

Now it seems there is no bottleneck for journaling as we are using
tmpfs, however still less then what I would expect write speed the sas
disks are barely busy via iostat..

So I thought it might be a disk bus throughput issue.

Next I completed some dd tests...

This below is in a script dd-x.sh which executes the 11 readers or
writers at once.

dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=32k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=32k count=100k
oflag=direct &

this gives me aggregated write throughput of around 1,135 MB/s Write.

Simular script now to test reads,

dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=32k count=100k
iflag=direct &

this gives me aggregated read throughput of around 1,382 MB/s Read.

Next I'll lower the block size to show the results,

dd if=/dev/zero of=/srv/ceph/osd.0/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.1/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.2/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.3/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.4/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.5/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.6/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.7/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.8/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.9/ddfile bs=4k count=100k
oflag=direct &

dd if=/dev/zero of=/srv/ceph/osd.10/ddfile bs=4k count=100k
oflag=direct &

this gives me aggregated write throughput of around 300 MB/s Write.

dd if=/srv/ceph/osd.0/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.1/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.2/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.3/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.4/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.5/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.6/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.7/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.8/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.9/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

dd if=/srv/ceph/osd.10/ddfile of=/dev/null bs=4k count=100k
iflag=direct &

this gives me aggregated read throughput of around 430 MB/s Read,

This is my ceph.conf, only difference between the configs is the
journal dio = false

----------------

[global]

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

public network = 10.100.96.0/24

cluster network = 10.100.128.0/24

journal dio = false

[mon]

mon data = /var/ceph/mon.$id

[mon.a]

host = rbd01

mon addr = 10.100.96.10:6789

[mon.b]

host = rbd02

mon addr = 10.100.96.11:6789

[mon.c]

host = rbd03

mon addr = 10.100.96.12:6789

[osd]

osd data = /srv/ceph/osd.$id

osd journal size = 1000

osd mkfs type = xfs

osd mkfs options xfs = "-f"

osd mount options xfs =
"rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k"

[osd.0]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sda5

[osd.1]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdb2

[osd.2]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdc2

[osd.3]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdd2

[osd.4]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sde2

[osd.5]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdf2

[osd.6]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdg2

[osd.7]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdh2

[osd.8]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdi2

[osd.9]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdj2

[osd.10]

host = rbd01

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdk2

[osd.11]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sda5

[osd.12]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdb2

[osd.13]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdc2

[osd.14]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdd2

[osd.15]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sde2

[osd.16]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdf2

[osd.17]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdg2

[osd.18]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdh2

[osd.19]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdi2

[osd.20]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdj2

[osd.21]

host = rbd02

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdk2

[osd.22]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sda5

[osd.23]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdb2

[osd.24]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdc2

[osd.25]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdd2

[osd.26]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sde2

[osd.27]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdf2

[osd.28]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdg2

[osd.29]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdh2

[osd.30]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdi2

[osd.31]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdj2

[osd.32]

host = rbd03

osd journal = /tmp/tmpfs/osd.$id

devs = /dev/sdk2

---------------------

Any Ideas?

Cheers,

Quenten

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com