Slow IOPS on RBD compared to journal and backing devices

chibi@xxxxxxx (Christian Balzer) · Wed, 14 May 2014 21:33:06 +0900

Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:

> Hi Christian,
> 
> I missed this thread, haven't been reading the list that well the last
> weeks.
> 
> You already know my setup, since we discussed it in an earlier thread. I
> don't have a fast backing store, but I see the slow IOPS when doing
> randwrite inside the VM, with rbd cache. Still running dumpling here
> though.
> 
Nods, I do recall that thread.

> A thought struck me that I could test with a pool that consists of OSDs
> that have tempfs-based disks, think I have a bit more latency than your
> IPoIB but I've pushed 100k IOPS with the same network devices before.
> This would verify if the problem is with the journal disks. I'll also
> try to run the journal devices in tempfs as well, as it would test
> purely Ceph itself.
>
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit. 

> I'll get back to you with the results, hopefully I'll manage to get them
> done during this night.
>
Looking forward to that. ^^

Christian 
> Cheers,
> Josef
> 
> On 13/05/14 11:03, Christian Balzer wrote:
> > I'm clearly talking to myself, but whatever.
> >
> > For Greg, I've played with all the pertinent journal and filestore
> > options and TCP nodelay, no changes at all.
> >
> > Is there anybody on this ML who's running a Ceph cluster with a fast
> > network and FAST filestore, so like me with a big HW cache in front of
> > a RAID/JBODs or using SSDs for final storage?
> >
> > If so, what results do you get out of the fio statement below per OSD?
> > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
> > which is of course vastly faster than the normal indvidual HDDs could
> > do.
> >
> > So I'm wondering if I'm hitting some inherent limitation of how fast a
> > single OSD (as in the software) can handle IOPS, given that everything
> > else has been ruled out from where I stand.
> >
> > This would also explain why none of the option changes or the use of
> > RBD caching has any measurable effect in the test case below. 
> > As in, a slow OSD aka single HDD with journal on the same disk would
> > clearly benefit from even the small 32MB standard RBD cache, while in
> > my test case the only time the caching becomes noticeable is if I
> > increase the cache size to something larger than the test data size.
> > ^o^
> >
> > On the other hand if people here regularly get thousands or tens of
> > thousands IOPS per OSD with the appropriate HW I'm stumped. 
> >
> > Christian
> >
> > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
> >
> >> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
> >>
> >>> Oh, I didn't notice that. I bet you aren't getting the expected
> >>> throughput on the RAID array with OSD access patterns, and that's
> >>> applying back pressure on the journal.
> >>>
> >> In the a "picture" being worth a thousand words tradition, I give you
> >> this iostat -x output taken during a fio run:
> >>
> >> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >>           50.82    0.00   19.43    0.17    0.00   29.58
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> >> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> >> sda               0.00    51.50    0.00 1633.50     0.00  7460.00
> >> 9.13     0.18    0.11    0.00    0.11   0.01   1.40 sdb
> >> 0.00     0.00    0.00 1240.50     0.00  5244.00     8.45     0.30
> >> 0.25    0.00    0.25   0.02   2.00 sdc               0.00     5.00
> >> 0.00 2468.50     0.00 13419.00    10.87     0.24    0.10    0.00
> >> 0.10   0.09  22.00 sdd               0.00     6.50    0.00 1913.00
> >> 0.00 10313.00    10.78     0.20    0.10    0.00    0.10   0.09  16.60
> >>
> >> The %user CPU utilization is pretty much entirely the 2 OSD processes,
> >> note the nearly complete absence of iowait.
> >>
> >> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
> >> Look at these numbers, the lack of queues, the low wait and service
> >> times (this is in ms) plus overall utilization.
> >>
> >> The only conclusion I can draw from these numbers and the network
> >> results below is that the latency happens within the OSD processes.
> >>
> >> Regards,
> >>
> >> Christian
> >>> When I suggested other tests, I meant with and without Ceph. One
> >>> particular one is OSD bench. That should be interesting to try at a
> >>> variety of block sizes. You could also try runnin RADOS bench and
> >>> smalliobench at a few different sizes.
> >>> -Greg
> >>>
> >>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
> >>> wrote:
> >>>
> >>>> Hi Christian,
> >>>>
> >>>> Do you have tried without raid6, to have more osd ?
> >>>> (how many disks do you have begin the raid6 ?)
> >>>>
> >>>>
> >>>> Aslo, I known that direct ios can be quite slow with ceph,
> >>>>
> >>>> maybe can you try without --direct=1
> >>>>
> >>>> and also enable rbd_cache
> >>>>
> >>>> ceph.conf
> >>>> [client]
> >>>> rbd cache = true
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>>
> >>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>>
> >>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
> >>>> ceph-users at lists.ceph.com <javascript:;>
> >>>> Envoy?: Jeudi 8 Mai 2014 04:49:16
> >>>> Objet: Re: Slow IOPS on RBD compared to journal and
> >>>> backing devices
> >>>>
> >>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
> >>>>
> >>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
> >>>>> <chibi at gol.com<javascript:;>>
> >>>> wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> >>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6
> >>>>>> behind an Areca 1882 with 4GB of cache.
> >>>>>>
> >>>>>> Running this fio:
> >>>>>>
> >>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> >>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
> >>>>>> --iodepth=128
> >>>>>>
> >>>>>> results in:
> >>>>>>
> >>>>>> 30k IOPS on the journal SSD (as expected)
> >>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise
> >>>>>> there) 3200 IOPS from a VM using userspace RBD
> >>>>>> 2900 IOPS from a host kernelspace mounted RBD
> >>>>>>
> >>>>>> When running the fio from the VM RBD the utilization of the
> >>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
> >>>>>> (1500 IOPS after some obvious merging).
> >>>>>> The OSD processes are quite busy, reading well over 200% on atop,
> >>>>>> but the system is not CPU or otherwise resource starved at that
> >>>>>> moment.
> >>>>>>
> >>>>>> Running multiple instances of this test from several VMs on
> >>>>>> different hosts changes nothing, as in the aggregated IOPS for
> >>>>>> the whole cluster will still be around 3200 IOPS.
> >>>>>>
> >>>>>> Now clearly RBD has to deal with latency here, but the network is
> >>>>>> IPoIB with the associated low latency and the journal SSDs are
> >>>>>> the (consistently) fasted ones around.
> >>>>>>
> >>>>>> I guess what I am wondering about is if this is normal and to be
> >>>>>> expected or if not where all that potential performance got lost.
> >>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
> >>>> Yes, but going down to 32 doesn't change things one iota.
> >>>> Also note the multiple instances I mention up there, so that would
> >>>> be 256 IOs at a time, coming from different hosts over different
> >>>> links and nothing changes.
> >>>>
> >>>>> that's about 40ms of latency per op (for userspace RBD), which
> >>>>> seems awfully long. You should check what your client-side objecter
> >>>>> settings are; it might be limiting you to fewer outstanding ops
> >>>>> than that.
> >>>> Googling for client-side objecter gives a few hits on ceph devel and
> >>>> bugs and nothing at all as far as configuration options are
> >>>> concerned. Care to enlighten me where one can find those?
> >>>>
> >>>> Also note the kernelspace (3.13 if it matters) speed, which is very
> >>>> much in the same (junior league) ballpark.
> >>>>
> >>>>> If
> >>>>> it's available to you, testing with Firefly or even master would be
> >>>>> interesting ? there's some performance work that should reduce
> >>>>> latencies.
> >>>>>
> >>>> Not an option, this is going into production next week.
> >>>>
> >>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster
> >>>>> certainly doesn't require 40ms/op, so you should probably run a
> >>>>> wider array of experiments to try and figure out where it's coming
> >>>>> from.
> >>>> I think we can rule out the network, NPtcp gives me:
> >>>> ---
> >>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
> >>>> ---
> >>>>
> >>>> For comparison at about 512KB it reaches maximum throughput and
> >>>> still isn't that laggy:
> >>>> ---
> >>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
> >>>> ---
> >>>>
> >>>> So with the network performing as well as my lengthy experience with
> >>>> IPoIB led me to believe, what else is there to look at?
> >>>> The storage nodes perform just as expected, indicated by the local
> >>>> fio tests.
> >>>>
> >>>> That pretty much leaves only Ceph/RBD to look at and I'm not really
> >>>> sure what experiments I should run on that. ^o^
> >>>>
> >>>> Regards,
> >>>>
> >>>> Christian
> >>>>
> >>>>> -Greg
> >>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>>>>
> >>>>
> >>>> --
> >>>> Christian Balzer Network/Systems Engineer
> >>>> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
> >>>> Communications http://www.gol.com/
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users at lists.ceph.com <javascript:;>
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/