Re: Understanding write performance

Christian Balzer <chibi@xxxxxxx> · Fri, 19 Aug 2016 14:49:04 +0900

Hello,

see below, inline.

On Thu, 18 Aug 2016 21:41:33 -0700 lewis.george@xxxxxxxxxxxxx wrote:

> Hi Christian,
>  Thank you for the follow-up on this. 
>   
>  I answered those questions inline below.
>   
>  Have a good day,
>   
>  Lewis George
>   
> 
> ----------------------------------------
>  From: "Christian Balzer" <chibi@xxxxxxx>
> Sent: Thursday, August 18, 2016 6:31 PM
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: "lewis.george@xxxxxxxxxxxxx" <lewis.george@xxxxxxxxxxxxx>
> Subject: Re:  Understanding write performance   
> 
> Hello,
> 
> On Thu, 18 Aug 2016 12:03:36 -0700 lewis.george@xxxxxxxxxxxxx wrote:
> 
> >> Hi,
> >> So, I have really been trying to find information about this without
> >> annoying the list, but I just can't seem to get any clear picture of it. 
> I
> >> was going to try to search the mailing list archive, but it seems there 
> is
> >> an error when trying to search it right now(posting below, and sending 
> to
> >> listed address in error).
> >>
> >Google (as in all the various archives of this ML) works well for me,
> >as always the results depend on picking "good" search strings.
> >
> >> I have been working for a couple of months now(slowly) on testing out
> >> Ceph. I only have a small PoC setup. I have 6 hosts, but I am only using 
> 3
> >> of them in the cluster at the moment. They each have 6xSSDs(only 5 
> usable
> >> by Ceph), but the networks(1 public, 1 cluster) are only 1Gbps. I have 
> the
> >> MONs running on the same 3 hosts, and I have an OSD process running for
> >> each of the 5 disks per host. The cluster shows in good health, with 15
> >> OSDs. I have one pool there, the default rbd, which I setup with 512 
> PGs.
> >>
> >Exact SSD models, please.
> >Also CPU, though at 1GbE that isn't going to be your problem.
>   
>  #Lewis: Each SSD is of model:
>  Model Family:     Samsung based SSDs
> Device Model:     Samsung SSD 840 PRO Series
> 
Consumer model, known to be deadly slow with dsync/direct writes.

And even if it didn't have those issues, endurance would make it a no-go
outside a PoC environment. 

>  Each of the 3 nodes has 2 x Intel E5645, with 48GB of memory.
>
That's plenty then.

> >> I have create an rbd image on the pool, and I have it mapped and 
> mounted
> >> on another client host.
>  >Mapped via the kernel interface?
>   
>  # Lewis On the client node(which is same specs as the 3 others), I used 
> the 'rbd map' command to map a 100GB rbd image to rbd0, then created an xfs 
> FS on there, and mounted it.
>
Kernel then, OK.

> >>When doing write tests, like with 'dd', I am
> >> getting rather spotty performance.
> >Example dd command line please.
>   
>  #Lewis: I put those below.
> 
> >>Not only is it up and down, but even
> >> when it is up, the performance isn't that great. On large'ish(4GB
> >> sequential) writes, it averages about 65MB/s, and on repeated 
> smaller(40MB)
> >> sequential writes, it is jumping around between 20MB/s and 80MB/s.
> >>
> >Monitor your storage nodes during these test runs with atop (or iostat)
> >and see how busy your actual SSDs are then.
> >Also test with "rados bench" to get a base line.
>   
>  #Lewis: I have all the nodes instrumented with collectd. I am seeing each 
> disk only writing at ~25MB/s during the write tests. 
That won't show you how busy the drives are, in fact I'm not aware of any
collectd plugin that will give you that info.

Use atop (or iostat) locally as I said, though I know what the output will
be now of course.

> I will check out the 
> 'rados bench' command, as I have not checked it yet.
> 
It will be in the same ballpark, now knowing what SSDs you have.

> >> However, with read tests, I am able to completely max out the network
> >> there, easily reaching 125MB/s. Tests on the disks directly are able to 
> get
> >> up to 550MB/s reads and 350MB/s writes. So, I know it isn't a problem 
> with
> >> the disks.
> >>
> >How did you test these speed, exact command line please.
> >There are SSDs that can write very fast with buffered I/O but are
> >abysmally slow with sync/direct I/O.
> >Which is what Ceph journals use.
>   
>  #Lewis: I have mostly been testing with just dd, though I have also tested 
> using several fio tests too. With dd, I have tested writing 4GB files, with 
> both 4k and 1M block sizes(get about the same results, on average).
>   
>  dd if=/dev/zero of=/mnt/set1/testfile700 bs=4k count=1000000 conv=fsync
>  dd if=/dev/zero of=/mnt/set1/testfile700 bs=1M count=4000 conv=fsync
> 

You're using fsync, but as per cited article below, this is not what the
journal code uses.

> >See the various threads in here and the "classic" link:
> >https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i
> s-suitable-as-a-journal-device/
>   
>  #Lewis: I have been reading over a lot of his articles. They are really 
> good. I did not see that one. Thank you for pointing it out.
> 
I wouldn't trust all the results and numbers there, some of them are
clearly wrong or were taken with differing methodologies.

But it's pretty obvious the Samsung pro/evo aren't suitable.
This is also where google comes in:
https://forum.proxmox.com/threads/slow-ceph-journal-on-samsung-850-pro.27733/

Christian

> >> I guess my question is, is there any additional optimizations or tuning 
> I
> >> should review here. I have read over all the docs, but I don't know 
> which,
> >> if any, of the values would need tweaking. Also, I am not sure if this 
> is
> >> just how it is with Ceph, given the need to write multiple copies of 
> each
> >> object. Is the slower write performance(averaging ~1/2 of the network
> >> throughput) to be expected? I haven't seen any clear answer on that in 
> the
> >> docs or in articles I have found around. So, I am not sure if my
> >> expectation is just wrong.
> >>
> >While the replication incurs some performance penalties, this is mostly 
> an
> >issue with small I/Os, not the type of large sequential writes you're
> >doing.
> >I'd expect a setup like yours to deliver more or less full line speed, if
> >your network and SSDs are working correctly.
> >
> >In my crappy test cluster with an identical network setup to yours, 4
> >nodes with 4 crappy SATA disks each (so 16 OSDs), I can get better and
> >more consistent write speed than you, around 100MB/s.
>  >
> >Christian
> >
> >> Anyway, some basic idea on those concepts or some pointers to some good
> >> docs or articles would be wonderful. Thank you!
> >>
> >> Lewis George
> >>
> >>
> >>
> >
> >
> >--
> >Christian Balzer Network/Systems Engineer
> >chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> >http://www.gol.com/
>  
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com