Slow IOPS on RBD compared to journal and backing devices

chibi@xxxxxxx (Christian Balzer) · Thu, 8 May 2014 15:26:33 +0900

Hello,

On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:

> Oh, I didn't notice that. I bet you aren't getting the expected
> throughput on the RAID array with OSD access patterns, and that's
> applying back pressure on the journal.
>
I doubt that based on what I see in terms of local performance and actual
utilization figures according to iostat and atop during the tests.

But if that were to be true, how would one see if that's the case, as in
where in the plethora of data from:

 ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump

is the data I'd be looking for?

> When I suggested other tests, I meant with and without Ceph. One
> particular one is OSD bench. That should be interesting to try at a
> variety of block sizes. You could also try runnin RADOS bench and
> smalliobench at a few different sizes.
>
I already did the local tests, as in w/o Ceph, see the original mail below.

And you might recall me doing rados benches as well in another thread 2
weeks ago or so.

In either case, osd benching gives me:
---
# time ceph tell osd.0 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "247102026.000000"}

real    0m4.483s
---
This is quite a bit slower than this particular SSD (200GB DC 3700) should
be able to write, but I will let that slide.
Note that it is the journal SSD that gets under pressure here (nearly 900%
util) while the OSD is bored at around 15%. Which is no surprise, as it
can write data at up to 1600MB/s. 

at 4k blocks we see:
---
# time ceph tell osd.0 bench 1073741824 4096
{ "bytes_written": 1073741824,
  "blocksize": 4096,
  "bytes_per_sec": "9004316.000000"}

real    1m59.368s
---
Here we get a more balanced picture between journal and storage
utilization, hovering around 40-50%. 
So clearly not overtaxing either component. 
But yet, this looks like 2100 IOPS to me, if my math is half right.

Rados at 4k gives us this:
---
 Total time run:         30.912786
Total writes made:      44490
Write size:             4096
Bandwidth (MB/sec):     5.622 

Stddev Bandwidth:       3.31452
Max bandwidth (MB/sec): 9.92578
Min bandwidth (MB/sec): 0
Average Latency:        0.0444653
Stddev Latency:         0.121887
Max latency:            2.80917
Min latency:            0.001958
--- 
So this is even worse, just about 1500 IOPS. 

Regards,

Christian

> -Greg
> 
> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
> wrote:
> 
> > Hi Christian,
> >
> > Do you have tried without raid6, to have more osd ?
> > (how many disks do you have begin the raid6 ?)
> >
> >
> > Aslo, I known that direct ios can be quite slow with ceph,
> >
> > maybe can you try without --direct=1
> >
> > and also enable rbd_cache
> >
> > ceph.conf
> > [client]
> > rbd cache = true
> >
> >
> >
> >
> > ----- Mail original -----
> >
> > De: "Christian Balzer" <chibi at gol.com <javascript:;>>
> > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
> > ceph-users at lists.ceph.com <javascript:;>
> > Envoy?: Jeudi 8 Mai 2014 04:49:16
> > Objet: Re: Slow IOPS on RBD compared to journal and
> > backing devices
> >
> > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
> >
> > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
> > > <chibi at gol.com<javascript:;>>
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6
> > > > behind an Areca 1882 with 4GB of cache.
> > > >
> > > > Running this fio:
> > > >
> > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
> > > > --iodepth=128
> > > >
> > > > results in:
> > > >
> > > > 30k IOPS on the journal SSD (as expected)
> > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise
> > > > there) 3200 IOPS from a VM using userspace RBD
> > > > 2900 IOPS from a host kernelspace mounted RBD
> > > >
> > > > When running the fio from the VM RBD the utilization of the
> > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
> > > > (1500 IOPS after some obvious merging).
> > > > The OSD processes are quite busy, reading well over 200% on atop,
> > > > but the system is not CPU or otherwise resource starved at that
> > > > moment.
> > > >
> > > > Running multiple instances of this test from several VMs on
> > > > different hosts changes nothing, as in the aggregated IOPS for the
> > > > whole cluster will still be around 3200 IOPS.
> > > >
> > > > Now clearly RBD has to deal with latency here, but the network is
> > > > IPoIB with the associated low latency and the journal SSDs are the
> > > > (consistently) fasted ones around.
> > > >
> > > > I guess what I am wondering about is if this is normal and to be
> > > > expected or if not where all that potential performance got lost.
> > >
> > > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
> > Yes, but going down to 32 doesn't change things one iota.
> > Also note the multiple instances I mention up there, so that would be
> > 256 IOs at a time, coming from different hosts over different links and
> > nothing changes.
> >
> > > that's about 40ms of latency per op (for userspace RBD), which seems
> > > awfully long. You should check what your client-side objecter
> > > settings are; it might be limiting you to fewer outstanding ops than
> > > that.
> >
> > Googling for client-side objecter gives a few hits on ceph devel and
> > bugs and nothing at all as far as configuration options are concerned.
> > Care to enlighten me where one can find those?
> >
> > Also note the kernelspace (3.13 if it matters) speed, which is very
> > much in the same (junior league) ballpark.
> >
> > > If
> > > it's available to you, testing with Firefly or even master would be
> > > interesting ? there's some performance work that should reduce
> > > latencies.
> > >
> > Not an option, this is going into production next week.
> >
> > > But a well-tuned (or even default-tuned, I thought) Ceph cluster
> > > certainly doesn't require 40ms/op, so you should probably run a wider
> > > array of experiments to try and figure out where it's coming from.
> >
> > I think we can rule out the network, NPtcp gives me:
> > ---
> > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
> > ---
> >
> > For comparison at about 512KB it reaches maximum throughput and still
> > isn't that laggy:
> > ---
> > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
> > ---
> >
> > So with the network performing as well as my lengthy experience with
> > IPoIB led me to believe, what else is there to look at?
> > The storage nodes perform just as expected, indicated by the local fio
> > tests.
> >
> > That pretty much leaves only Ceph/RBD to look at and I'm not really
> > sure what experiments I should run on that. ^o^
> >
> > Regards,
> >
> > Christian
> >
> > > -Greg
> > > Software Engineer #42 @ http://inktank.com | http://ceph.com
> > >
> >
> >
> > --
> > Christian Balzer Network/Systems Engineer
> > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com <javascript:;>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/