Slow IOPS on RBD compared to journal and backing devices

greg@xxxxxxxxxxx (Gregory Farnum) · Wed, 7 May 2014 22:13:53 -0700

Oh, I didn't notice that. I bet you aren't getting the expected throughput
on the RAID array with OSD access patterns, and that's applying back
pressure on the journal.

When I suggested other tests, I meant with and without Ceph. One particular
one is OSD bench. That should be interesting to try at a variety of block
sizes. You could also try runnin RADOS bench and smalliobench at a few
different sizes.
-Greg

On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> wrote:

> Hi Christian,
>
> Do you have tried without raid6, to have more osd ?
> (how many disks do you have begin the raid6 ?)
>
>
> Aslo, I known that direct ios can be quite slow with ceph,
>
> maybe can you try without --direct=1
>
> and also enable rbd_cache
>
> ceph.conf
> [client]
> rbd cache = true
>
>
>
>
> ----- Mail original -----
>
> De: "Christian Balzer" <chibi at gol.com <javascript:;>>
> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
> ceph-users at lists.ceph.com <javascript:;>
> Envoy?: Jeudi 8 Mai 2014 04:49:16
> Objet: Re: Slow IOPS on RBD compared to journal and backing
> devices
>
> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
>
> > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer <chibi at gol.com<javascript:;>>
> wrote:
> > >
> > > Hello,
> > >
> > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 behind
> > > an Areca 1882 with 4GB of cache.
> > >
> > > Running this fio:
> > >
> > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k --iodepth=128
> > >
> > > results in:
> > >
> > > 30k IOPS on the journal SSD (as expected)
> > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise
> > > there) 3200 IOPS from a VM using userspace RBD
> > > 2900 IOPS from a host kernelspace mounted RBD
> > >
> > > When running the fio from the VM RBD the utilization of the journals is
> > > about 20% (2400 IOPS) and the OSDs are bored at 2% (1500 IOPS after
> > > some obvious merging).
> > > The OSD processes are quite busy, reading well over 200% on atop, but
> > > the system is not CPU or otherwise resource starved at that moment.
> > >
> > > Running multiple instances of this test from several VMs on different
> > > hosts changes nothing, as in the aggregated IOPS for the whole cluster
> > > will still be around 3200 IOPS.
> > >
> > > Now clearly RBD has to deal with latency here, but the network is IPoIB
> > > with the associated low latency and the journal SSDs are the
> > > (consistently) fasted ones around.
> > >
> > > I guess what I am wondering about is if this is normal and to be
> > > expected or if not where all that potential performance got lost.
> >
> > Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
> Yes, but going down to 32 doesn't change things one iota.
> Also note the multiple instances I mention up there, so that would be 256
> IOs at a time, coming from different hosts over different links and
> nothing changes.
>
> > that's about 40ms of latency per op (for userspace RBD), which seems
> > awfully long. You should check what your client-side objecter settings
> > are; it might be limiting you to fewer outstanding ops than that.
>
> Googling for client-side objecter gives a few hits on ceph devel and bugs
> and nothing at all as far as configuration options are concerned.
> Care to enlighten me where one can find those?
>
> Also note the kernelspace (3.13 if it matters) speed, which is very much
> in the same (junior league) ballpark.
>
> > If
> > it's available to you, testing with Firefly or even master would be
> > interesting ? there's some performance work that should reduce
> > latencies.
> >
> Not an option, this is going into production next week.
>
> > But a well-tuned (or even default-tuned, I thought) Ceph cluster
> > certainly doesn't require 40ms/op, so you should probably run a wider
> > array of experiments to try and figure out where it's coming from.
>
> I think we can rule out the network, NPtcp gives me:
> ---
> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
> ---
>
> For comparison at about 512KB it reaches maximum throughput and still
> isn't that laggy:
> ---
> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
> ---
>
> So with the network performing as well as my lengthy experience with IPoIB
> led me to believe, what else is there to look at?
> The storage nodes perform just as expected, indicated by the local fio
> tests.
>
> That pretty much leaves only Ceph/RBD to look at and I'm not really sure
> what experiments I should run on that. ^o^
>
> Regards,
>
> Christian
>
> > -Greg
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> >
>
>
> --
> Christian Balzer Network/Systems Engineer
> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com <javascript:;>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140507/c2e48328/attachment.htm>