Slow IOPS on RBD compared to journal andbackingdevices

josef@xxxxxxxxxxx (Josef Johansson) · Wed, 14 May 2014 15:09:53 +0200

Hi,

On 14/05/14 14:45, German Anders wrote:
> I forgot to mention, of course on a 10GbE network
>  
>  
>
> *German Anders*
> /Field Storage Support Engineer/**
>
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>  
>> --- Original message ---
>> *Asunto:* Re: Slow IOPS on RBD compared to journal
>> andbackingdevices
>> *De:* German Anders <ganders at despegar.com>
>> *Para:* Christian Balzer <chibi at gol.com>
>> *Cc:* <ceph-users at lists.ceph.com>
>> *Fecha:* Wednesday, 14/05/2014 09:41
>>
>> Someone could get a performance throughput on RBD of 600MB/s or more
>> on (rw) with a block size of 32768k?
>>  
Is that 32M then?
Sequential or randwrite?

I get about those speeds when doing (1M block size) buffered writes from
within a VM on 20GbE. The cluster max out at about 900MB/s.

Cheers,
Josef
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>>     --- Original message ---
>>     *Asunto:* Re: Slow IOPS on RBD compared to journal
>>     and backingdevices
>>     *De:* Christian Balzer <chibi at gol.com>
>>     *Para:* Josef Johansson <josef at oderland.se>
>>     *Cc:* <ceph-users at lists.ceph.com>
>>     *Fecha:* Wednesday, 14/05/2014 09:33
>>
>>
>>     Hello!
>>
>>     On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>>         Hi Christian,
>>
>>         I missed this thread, haven't been reading the list that well
>>         the last
>>         weeks.
>>
>>         You already know my setup, since we discussed it in an
>>         earlier thread. I
>>         don't have a fast backing store, but I see the slow IOPS when
>>         doing
>>         randwrite inside the VM, with rbd cache. Still running
>>         dumpling here
>>         though.
>>
>>     Nods, I do recall that thread.
>>
>>         A thought struck me that I could test with a pool that
>>         consists of OSDs
>>         that have tempfs-based disks, think I have a bit more latency
>>         than your
>>         IPoIB but I've pushed 100k IOPS with the same network devices
>>         before.
>>         This would verify if the problem is with the journal disks.
>>         I'll also
>>         try to run the journal devices in tempfs as well, as it would
>>         test
>>         purely Ceph itself.
>>
>>     That would be interesting indeed.
>>     Given what I've seen (with the journal at 20% utilization and the
>>     actual
>>     filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>>         I'll get back to you with the results, hopefully I'll manage
>>         to get them
>>         done during this night.
>>
>>     Looking forward to that. ^^
>>
>>
>>     Christian
>>
>>         Cheers,
>>         Josef
>>
>>         On 13/05/14 11:03, Christian Balzer wrote:
>>
>>             I'm clearly talking to myself, but whatever.
>>
>>             For Greg, I've played with all the pertinent journal and
>>             filestore
>>             options and TCP nodelay, no changes at all.
>>
>>             Is there anybody on this ML who's running a Ceph cluster
>>             with a fast
>>             network and FAST filestore, so like me with a big HW
>>             cache in front of
>>             a RAID/JBODs or using SSDs for final storage?
>>
>>             If so, what results do you get out of the fio statement
>>             below per OSD?
>>             In my case with 4 OSDs and 3200 IOPS that's about 800
>>             IOPS per OSD,
>>             which is of course vastly faster than the normal
>>             indvidual HDDs could
>>             do.
>>
>>             So I'm wondering if I'm hitting some inherent limitation
>>             of how fast a
>>             single OSD (as in the software) can handle IOPS, given
>>             that everything
>>             else has been ruled out from where I stand.
>>
>>             This would also explain why none of the option changes or
>>             the use of
>>             RBD caching has any measurable effect in the test case
>>             below.
>>             As in, a slow OSD aka single HDD with journal on the same
>>             disk would
>>             clearly benefit from even the small 32MB standard RBD
>>             cache, while in
>>             my test case the only time the caching becomes noticeable
>>             is if I
>>             increase the cache size to something larger than the test
>>             data size.
>>             ^o^
>>
>>             On the other hand if people here regularly get thousands
>>             or tens of
>>             thousands IOPS per OSD with the appropriate HW I'm stumped.
>>
>>             Christian
>>
>>             On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>>
>>                 On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>
>>                     Oh, I didn't notice that. I bet you aren't
>>                     getting the expected
>>                     throughput on the RAID array with OSD access
>>                     patterns, and that's
>>                     applying back pressure on the journal.
>>
>>                 In the a "picture" being worth a thousand words
>>                 tradition, I give you
>>                 this iostat -x output taken during a fio run:
>>
>>                 avg-cpu: %user %nice %system %iowait %steal %idle
>>                             50.82 0.00 19.43 0.17 0.00 29.58
>>
>>                 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>                 avgrq-sz avgqu-sz await r_await w_await svctm %util
>>                 sda 0.00 51.50 0.00 1633.50 0.00 7460.00
>>                 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
>>                 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
>>                 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
>>                 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
>>                 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
>>                 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
>>
>>                 The %user CPU utilization is pretty much entirely the
>>                 2 OSD processes,
>>                 note the nearly complete absence of iowait.
>>
>>                 sda and sdb are the OSDs RAIDs, sdc and sdd are the
>>                 journal SSDs.
>>                 Look at these numbers, the lack of queues, the low
>>                 wait and service
>>                 times (this is in ms) plus overall utilization.
>>
>>                 The only conclusion I can draw from these numbers and
>>                 the network
>>                 results below is that the latency happens within the
>>                 OSD processes.
>>
>>                 Regards,
>>
>>                 Christian
>>
>>                     When I suggested other tests, I meant with and
>>                     without Ceph. One
>>                     particular one is OSD bench. That should be
>>                     interesting to try at a
>>                     variety of block sizes. You could also try runnin
>>                     RADOS bench and
>>                     smalliobench at a few different sizes.
>>                     -Greg
>>
>>                     On Wednesday, May 7, 2014, Alexandre DERUMIER
>>                     <aderumier at odiso.com>
>>                     wrote:
>>
>>                         Hi Christian,
>>
>>                         Do you have tried without raid6, to have more
>>                         osd ?
>>                         (how many disks do you have begin the raid6 ?)
>>
>>
>>                         Aslo, I known that direct ios can be quite
>>                         slow with ceph,
>>
>>                         maybe can you try without --direct=1
>>
>>                         and also enable rbd_cache
>>
>>                         ceph.conf
>>                         [client]
>>                         rbd cache = true
>>
>>
>>
>>
>>                         ----- Mail original -----
>>
>>                         De: "Christian Balzer" <chibi at gol.com
>>                         <javascript:;>>
>>                         ?: "Gregory Farnum" <greg at inktank.com
>>                         <javascript:;>>,
>>                         ceph-users at lists.ceph.com <javascript:;>
>>                         Envoy?: Jeudi 8 Mai 2014 04:49:16
>>                         Objet: Re: Slow IOPS on RBD
>>                         compared to journal and
>>                         backing devices
>>
>>                         On Wed, 7 May 2014 18:37:48 -0700 Gregory
>>                         Farnum wrote:
>>
>>                             On Wed, May 7, 2014 at 5:57 PM, Christian
>>                             Balzer
>>                             <chibi at gol.com<javascript:;>>
>>
>>                         wrote:
>>
>>                                 Hello,
>>
>>                                 ceph 0.72 on Debian Jessie, 2 storage
>>                                 nodes with 2 OSDs each. The
>>                                 journals are on (separate) DC 3700s,
>>                                 the actual OSDs are RAID6
>>                                 behind an Areca 1882 with 4GB of cache.
>>
>>                                 Running this fio:
>>
>>                                 fio --size=400m --ioengine=libaio
>>                                 --invalidate=1 --direct=1
>>                                 --numjobs=1 --rw=randwrite
>>                                 --name=fiojob --blocksize=4k
>>                                 --iodepth=128
>>
>>                                 results in:
>>
>>                                 30k IOPS on the journal SSD (as expected)
>>                                 110k IOPS on the OSD (it fits neatly
>>                                 into the cache, no surprise
>>                                 there) 3200 IOPS from a VM using
>>                                 userspace RBD
>>                                 2900 IOPS from a host kernelspace
>>                                 mounted RBD
>>
>>                                 When running the fio from the VM RBD
>>                                 the utilization of the
>>                                 journals is about 20% (2400 IOPS) and
>>                                 the OSDs are bored at 2%
>>                                 (1500 IOPS after some obvious merging).
>>                                 The OSD processes are quite busy,
>>                                 reading well over 200% on atop,
>>                                 but the system is not CPU or
>>                                 otherwise resource starved at that
>>                                 moment.
>>
>>                                 Running multiple instances of this
>>                                 test from several VMs on
>>                                 different hosts changes nothing, as
>>                                 in the aggregated IOPS for
>>                                 the whole cluster will still be
>>                                 around 3200 IOPS.
>>
>>                                 Now clearly RBD has to deal with
>>                                 latency here, but the network is
>>                                 IPoIB with the associated low latency
>>                                 and the journal SSDs are
>>                                 the (consistently) fasted ones around.
>>
>>                                 I guess what I am wondering about is
>>                                 if this is normal and to be
>>                                 expected or if not where all that
>>                                 potential performance got lost.
>>
>>                             Hmm, with 128 IOs at a time (I believe
>>                             I'm reading that correctly?)
>>
>>                         Yes, but going down to 32 doesn't change
>>                         things one iota.
>>                         Also note the multiple instances I mention up
>>                         there, so that would
>>                         be 256 IOs at a time, coming from different
>>                         hosts over different
>>                         links and nothing changes.
>>
>>                             that's about 40ms of latency per op (for
>>                             userspace RBD), which
>>                             seems awfully long. You should check what
>>                             your client-side objecter
>>                             settings are; it might be limiting you to
>>                             fewer outstanding ops
>>                             than that.
>>
>>                         Googling for client-side objecter gives a few
>>                         hits on ceph devel and
>>                         bugs and nothing at all as far as
>>                         configuration options are
>>                         concerned. Care to enlighten me where one can
>>                         find those?
>>
>>                         Also note the kernelspace (3.13 if it
>>                         matters) speed, which is very
>>                         much in the same (junior league) ballpark.
>>
>>                             If
>>                             it's available to you, testing with
>>                             Firefly or even master would be
>>                             interesting --- there's some performance
>>                             work that should reduce
>>                             latencies.
>>
>>                         Not an option, this is going into production
>>                         next week.
>>
>>                             But a well-tuned (or even default-tuned,
>>                             I thought) Ceph cluster
>>                             certainly doesn't require 40ms/op, so you
>>                             should probably run a
>>                             wider array of experiments to try and
>>                             figure out where it's coming
>>                             from.
>>
>>                         I think we can rule out the network, NPtcp
>>                         gives me:
>>                         ---
>>                         56: 4096 bytes 1546 times --> 979.22 Mbps in
>>                         31.91 usec
>>                         ---
>>
>>                         For comparison at about 512KB it reaches
>>                         maximum throughput and
>>                         still isn't that laggy:
>>                         ---
>>                         98: 524288 bytes 121 times --> 9700.57 Mbps
>>                         in 412.35 usec
>>                         ---
>>
>>                         So with the network performing as well as my
>>                         lengthy experience with
>>                         IPoIB led me to believe, what else is there
>>                         to look at?
>>                         The storage nodes perform just as expected,
>>                         indicated by the local
>>                         fio tests.
>>
>>                         That pretty much leaves only Ceph/RBD to look
>>                         at and I'm not really
>>                         sure what experiments I should run on that. ^o^
>>
>>                         Regards,
>>
>>                         Christian
>>
>>                             -Greg
>>                             Software Engineer #42 @
>>                             http://inktank.com | http://ceph.com
>>
>>
>>                         --
>>                         Christian Balzer Network/Systems Engineer
>>                         chibi at gol.com <javascript:;> Global OnLine
>>                         Japan/Fusion
>>                         Communications http://www.gol.com/
>>                         _______________________________________________
>>                         ceph-users mailing list
>>                         ceph-users at lists.ceph.com <javascript:;>
>>                         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>>         _______________________________________________
>>         ceph-users mailing list
>>         ceph-users at lists.ceph.com
>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>     -- 
>>     Christian Balzer Network/Systems Engineer
>>     chibi at gol.com Global OnLine Japan/Fusion Communications
>>     http://www.gol.com/
>>     _______________________________________________
>>     ceph-users mailing list
>>     ceph-users at lists.ceph.com
>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/05d2cbbe/attachment.htm>