Re: Understanding write performance

"lewis.george@xxxxxxxxxxxxx" <lewis.george@xxxxxxxxxxxxx> · Thu, 18 Aug 2016 21:41:33 -0700

Hi Christian,

Thank you for the follow-up on this. 

I answered those questions inline below.

Have a good day,

Lewis George

From: "Christian Balzer" <chibi@xxxxxxx>

Sent: Thursday, August 18, 2016 6:31 PM

To: ceph-users@xxxxxxxxxxxxxx

Cc: "lewis.george@xxxxxxxxxxxxx" <lewis.george@xxxxxxxxxxxxx>

Subject: Re:  Understanding write performance

Hello,

On Thu, 18 Aug 2016 12:03:36 -0700 lewis.george@xxxxxxxxxxxxx wrote:

>> Hi,

>> So, I have really been trying to find information about this without

>> annoying the list, but I just can't seem to get any clear picture of it. I

>> was going to try to search the mailing list archive, but it seems there is

>> an error when trying to search it right now(posting below, and sending to

>> listed address in error).

>>

>Google (as in all the various archives of this ML) works well for me,

>as always the results depend on picking "good" search strings.

>

>> I have been working for a couple of months now(slowly) on testing out

>> Ceph. I only have a small PoC setup. I have 6 hosts, but I am only using 3

>> of them in the cluster at the moment. They each have 6xSSDs(only 5 usable

>> by Ceph), but the networks(1 public, 1 cluster) are only 1Gbps. I have the

>> MONs running on the same 3 hosts, and I have an OSD process running for

>> each of the 5 disks per host. The cluster shows in good health, with 15

>> OSDs. I have one pool there, the default rbd, which I setup with 512 PGs.

>>

>Exact SSD models, please.

>Also CPU, though at 1GbE that isn't going to be your problem.

#Lewis: Each SSD is of model:

Model Family:     Samsung based SSDs

Device Model:     Samsung SSD 840 PRO Series

Each of the 3 nodes has 2 x Intel E5645, with 48GB of memory.

>> I have create an rbd image on the pool, and I have it mapped and mounted

>> on another client host.

>Mapped via the kernel interface?

# Lewis On the client node(which is same specs as the 3 others), I used the 'rbd map' command to map a 100GB rbd image to rbd0, then created an xfs FS on there, and mounted it.

>>When doing write tests, like with 'dd', I am

>> getting rather spotty performance.

>Example dd command line please.

#Lewis: I put those below.

>>Not only is it up and down, but even

>> when it is up, the performance isn't that great. On large'ish(4GB

>> sequential) writes, it averages about 65MB/s, and on repeated smaller(40MB)

>> sequential writes, it is jumping around between 20MB/s and 80MB/s.

>>

>Monitor your storage nodes during these test runs with atop (or iostat)

>and see how busy your actual SSDs are then.

>Also test with "rados bench" to get a base line.

#Lewis: I have all the nodes instrumented with collectd. I am seeing each disk only writing at ~25MB/s during the write tests. I will check out the 'rados bench' command, as I have not checked it yet.

>> However, with read tests, I am able to completely max out the network

>> there, easily reaching 125MB/s. Tests on the disks directly are able to get

>> up to 550MB/s reads and 350MB/s writes. So, I know it isn't a problem with

>> the disks.

>>

>How did you test these speed, exact command line please.

>There are SSDs that can write very fast with buffered I/O but are

>abysmally slow with sync/direct I/O.

>Which is what Ceph journals use.

#Lewis: I have mostly been testing with just dd, though I have also tested using several fio tests too. With dd, I have tested writing 4GB files, with both 4k and 1M block sizes(get about the same results, on average).

dd if=/dev/zero of=/mnt/set1/testfile700 bs=4k count=1000000 conv=fsync

dd if=/dev/zero of=/mnt/set1/testfile700 bs=1M count=4000 conv=fsync

>See the various threads in here and the "classic" link:

>https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

#Lewis: I have been reading over a lot of his articles. They are really good. I did not see that one. Thank you for pointing it out.

>> I guess my question is, is there any additional optimizations or tuning I

>> should review here. I have read over all the docs, but I don't know which,

>> if any, of the values would need tweaking. Also, I am not sure if this is

>> just how it is with Ceph, given the need to write multiple copies of each

>> object. Is the slower write performance(averaging ~1/2 of the network

>> throughput) to be expected? I haven't seen any clear answer on that in the

>> docs or in articles I have found around. So, I am not sure if my

>> expectation is just wrong.

>>

>While the replication incurs some performance penalties, this is mostly an

>issue with small I/Os, not the type of large sequential writes you're

>doing.

>I'd expect a setup like yours to deliver more or less full line speed, if

>your network and SSDs are working correctly.

>

>In my crappy test cluster with an identical network setup to yours, 4

>nodes with 4 crappy SATA disks each (so 16 OSDs), I can get better and

>more consistent write speed than you, around 100MB/s.

>

>Christian

>

>> Anyway, some basic idea on those concepts or some pointers to some good

>> docs or articles would be wonderful. Thank you!

>>

>> Lewis George

>>

>>

>>

>

>

>--

>Christian Balzer Network/Systems Engineer

>chibi@xxxxxxx Global OnLine Japan/Rakuten Communications

>http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com