Re: RBD with SSD journals and SAS OSDs

Christian Balzer <chibi@xxxxxxx> · Mon, 17 Oct 2016 11:47:31 +0900

Hello,

On Sun, 16 Oct 2016 19:07:17 +0800 William Josefsson wrote:

> Ok thanks for sharing. yes my journals are Intel S3610 200GB, which I
> partition in 4 partitions each ~45GB. When I ceph-deploy I declare
> these as the journals of the OSDs.
>
The size (45GB) of these journals is only going to be used by a little
fraction, unlikely to be more than 1GB in normal operations and with
default filestore/journal parameters.

Because those defaults start flushing things (from RAM, the journal never
gets read unless there is a crash) to the filestore (OSD HDD) pretty much
immediately.

Again, use google to search the ML archives.

> I was trying to understand the blocking, and how much my SAS OSDs
> affected my performance. I have a total of 9 hosts, 158 OSDs each
> 1.8TB. The Servers are connected through copper 10Gbit LACP bonds.
> My failure domain is by type RACK. The CRUSH rule set is by rack. 3
> hosts in each rack. Pool size is =3. I'm running hammer on centos7.
> 

Which begs the question to fully detail your HW (CPUs, RAM), network
(topology, what switches, inter-rack/switch links), etc.
The reason for this will become obvious below.

> I did a simple fio test from one of my xl instances, and got the
> results below. The Latency 7.21ms is worrying, is this expected
> results? Or is there any way I can further tune my cluster to achieve
> better results? thx will
> 

> FIO: sync=1, direct=1, bs=4k
>
Full command line, please.

Small, sync I/Os are by far the hardest thing for Ceph.

I can guess what some of the rest was, but it's better to know for sure.
Alternatively, additionally, try this please:

"fio --size=1G --ioengine=libaio --invalidate=1  --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4K --iodepth=32"

> 
> write-50: (groupid=11, jobs=50): err= 0: pid=3945: Sun Oct 16 08:41:15 2016
>   write: io=832092KB, bw=27721KB/s, iops=6930, runt= 30017msec
>     clat (msec): min=2, max=253, avg= 7.21, stdev= 4.97
>      lat (msec): min=2, max=253, avg= 7.21, stdev= 4.97

These numbers suggest you did randwrite and aren't all that surprising.
If you were to run atop on your OSD nodes while doing that fio run, you'll
likely see that both CPUs and individual disk (HDDs) get very busy.

There are several things conspiring against Ceph here, the latency of it's
own code, the network latency of getting all the individual writes to each
replica, the fact that 1000 of these 4K blocks will hit one typical RBD
object (4MB) and thus one PG, make 3 OSDs very busy, etc.

If you absolutely need low latencies with Ceph, consider dedicated SSD
only pools for special need applications (DB) or a cache tier if it fits
the profile and avtive working set.
Lower Ceph latency in general by having fast CPUs which are have
powersaving (frequency throttling) disabled or set to "performance"
instead of "ondemand". 

Christan

>     clat percentiles (msec):
>      |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
>      | 30.00th=[    5], 40.00th=[    6], 50.00th=[    7], 60.00th=[    8],
>      | 70.00th=[    9], 80.00th=[   10], 90.00th=[   12], 95.00th=[   14],
>      | 99.00th=[   17], 99.50th=[   19], 99.90th=[   21], 99.95th=[   23],
>      | 99.99th=[  253]
>     bw (KB  /s): min=  341, max=  870, per=2.01%, avg=556.60, stdev=136.98
>     lat (msec) : 4=8.24%, 10=74.10%, 20=17.52%, 50=0.12%, 500=0.02%
>   cpu          : usr=0.04%, sys=0.23%, ctx=425242, majf=0, minf=1570
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=208023/d=0, short=r=0/w=0/d=0
> 
> On Sun, Oct 16, 2016 at 4:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >
> > Hello,
> >
> > On Sun, 16 Oct 2016 15:03:24 +0800 William Josefsson wrote:
> >
> >> Hi list, while I know that writes in the RADOS backend are sync() can
> >> anyone please explain when the cluster will return on a write call for
> >> RBD from VMs? Will data be considered synced one written to the
> >> journal or all the way to the OSD drive?
> >>
> > This has been answered countless (really) here, the Ceph Architecture
> > documentation should really be more detailed about this, as well as how
> > parallel the data is being sent to the secondary OSDs.
> >
> > It is of course ack'ed to the client once all journals have successfully
> > written the data, otherwise journal SSDs would make a LOT less sense.
> >
> >> Each host in my cluster has 5x Intel S3610, and 18x1.8TB Hitachi 10krpm SAS.
> >>
> > The size of your SSDs (you didn't mention) will determine the speed, for
> > journal purposes the sequential write speed is basically it.
> >
> > A 5:18 ratio implies that some of your SSDs hold more journals than others.
> >
> > You emphatically do NOT want that, because eventually the busier ones will
> > run out of endurance while the other ones still have plenty left.
> >
> > If possible change this to a 5:20 or 6:18 ratio (depending on your SSDs
> > and expected write volume).
> >
> > Christian
> >> I have size=3 for my pool. Will Ceph return once the data is written
> >> to at least 3 designated journals, or will it in fact wait until the
> >> data is written to the OSD drives? thx will
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com