Slow IOPS on RBD compared to journalandbackingdevices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Yeah, running with MTU 9000 here, but the test was with sequential.

Just ran rbd -p shared-1 bench-write test --io-size $((32*1024*1024))
--io-pattern rand

The cluster itself showed 700MB/s write (3x replicas), but the test just
45MB/s. But I think rbd is a little bit broken ;)

Cheers,
Josef

On 14/05/14 15:23, German Anders wrote:
> Hi Josef,     
> Thanks a lot for the quick answer.
>
> yes 32M and rand writes
>
> and also, do you get those values i guess with a MTU of 9000 or with
> the traditional and beloved MTU 1500?
>
>  
>
> *German Anders*
> /Field Storage Support Engineer/**
>
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>  
>> --- Original message ---
>> *Asunto:* Re: Slow IOPS on RBD compared to
>> journalandbackingdevices
>> *De:* Josef Johansson <josef at oderland.se>
>> *Para:* <ceph-users at lists.ceph.com>
>> *Fecha:* Wednesday, 14/05/2014 10:10
>>
>> Hi,
>>
>> On 14/05/14 14:45, German Anders wrote:
>>
>>     I forgot to mention, of course on a 10GbE network
>>      
>>      
>>
>>     *German Anders*
>>     /Field Storage Support Engineer/**
>>
>>     Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>      
>>
>>         --- Original message ---
>>         *Asunto:* Re: Slow IOPS on RBD compared to
>>         journal andbackingdevices
>>         *De:* German Anders <ganders at despegar.com>
>>         *Para:* Christian Balzer <chibi at gol.com>
>>         *Cc:* <ceph-users at lists.ceph.com>
>>         *Fecha:* Wednesday, 14/05/2014 09:41
>>
>>         Someone could get a performance throughput on RBD of 600MB/s
>>         or more on (rw) with a block size of 32768k?
>>          
>>
>> Is that 32M then?
>> Sequential or randwrite?
>>
>> I get about those speeds when doing (1M block size) buffered writes
>> from within a VM on 20GbE. The cluster max out at about 900MB/s.
>>
>> Cheers,
>> Josef
>>
>>          
>>
>>         *German Anders*
>>         /Field Storage Support Engineer/**
>>
>>         Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>          
>>
>>             --- Original message ---
>>             *Asunto:* Re: Slow IOPS on RBD compared to
>>             journal and backingdevices
>>             *De:* Christian Balzer <chibi at gol.com>
>>             *Para:* Josef Johansson <josef at oderland.se>
>>             *Cc:* <ceph-users at lists.ceph.com>
>>             *Fecha:* Wednesday, 14/05/2014 09:33
>>
>>
>>             Hello!
>>
>>             On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>>                 Hi Christian,
>>
>>                 I missed this thread, haven't been reading the list
>>                 that well the last
>>                 weeks.
>>
>>                 You already know my setup, since we discussed it in
>>                 an earlier thread. I
>>                 don't have a fast backing store, but I see the slow
>>                 IOPS when doing
>>                 randwrite inside the VM, with rbd cache. Still
>>                 running dumpling here
>>                 though.
>>
>>             Nods, I do recall that thread.
>>
>>                 A thought struck me that I could test with a pool
>>                 that consists of OSDs
>>                 that have tempfs-based disks, think I have a bit more
>>                 latency than your
>>                 IPoIB but I've pushed 100k IOPS with the same network
>>                 devices before.
>>                 This would verify if the problem is with the journal
>>                 disks. I'll also
>>                 try to run the journal devices in tempfs as well, as
>>                 it would test
>>                 purely Ceph itself.
>>
>>             That would be interesting indeed.
>>             Given what I've seen (with the journal at 20% utilization
>>             and the actual
>>             filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>>                 I'll get back to you with the results, hopefully I'll
>>                 manage to get them
>>                 done during this night.
>>
>>             Looking forward to that. ^^
>>
>>
>>             Christian
>>
>>                 Cheers,
>>                 Josef
>>
>>                 On 13/05/14 11:03, Christian Balzer wrote:
>>
>>                     I'm clearly talking to myself, but whatever.
>>
>>                     For Greg, I've played with all the pertinent
>>                     journal and filestore
>>                     options and TCP nodelay, no changes at all.
>>
>>                     Is there anybody on this ML who's running a Ceph
>>                     cluster with a fast
>>                     network and FAST filestore, so like me with a big
>>                     HW cache in front of
>>                     a RAID/JBODs or using SSDs for final storage?
>>
>>                     If so, what results do you get out of the fio
>>                     statement below per OSD?
>>                     In my case with 4 OSDs and 3200 IOPS that's about
>>                     800 IOPS per OSD,
>>                     which is of course vastly faster than the normal
>>                     indvidual HDDs could
>>                     do.
>>
>>                     So I'm wondering if I'm hitting some inherent
>>                     limitation of how fast a
>>                     single OSD (as in the software) can handle IOPS,
>>                     given that everything
>>                     else has been ruled out from where I stand.
>>
>>                     This would also explain why none of the option
>>                     changes or the use of
>>                     RBD caching has any measurable effect in the test
>>                     case below.
>>                     As in, a slow OSD aka single HDD with journal on
>>                     the same disk would
>>                     clearly benefit from even the small 32MB standard
>>                     RBD cache, while in
>>                     my test case the only time the caching becomes
>>                     noticeable is if I
>>                     increase the cache size to something larger than
>>                     the test data size.
>>                     ^o^
>>
>>                     On the other hand if people here regularly get
>>                     thousands or tens of
>>                     thousands IOPS per OSD with the appropriate HW
>>                     I'm stumped.
>>
>>                     Christian
>>
>>                     On Fri, 9 May 2014 11:01:26 +0900 Christian
>>                     Balzer wrote:
>>
>>                         On Wed, 7 May 2014 22:13:53 -0700 Gregory
>>                         Farnum wrote:
>>
>>                             Oh, I didn't notice that. I bet you
>>                             aren't getting the expected
>>                             throughput on the RAID array with OSD
>>                             access patterns, and that's
>>                             applying back pressure on the journal.
>>
>>                         In the a "picture" being worth a thousand
>>                         words tradition, I give you
>>                         this iostat -x output taken during a fio run:
>>
>>                         avg-cpu: %user %nice %system %iowait %steal %idle
>>                                     50.82 0.00 19.43 0.17 0.00 29.58
>>
>>                         Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>                         avgrq-sz avgqu-sz await r_await w_await svctm
>>                         %util
>>                         sda 0.00 51.50 0.00 1633.50 0.00 7460.00
>>                         9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
>>                         0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
>>                         0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
>>                         0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
>>                         0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
>>                         0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09
>>                         16.60
>>
>>                         The %user CPU utilization is pretty much
>>                         entirely the 2 OSD processes,
>>                         note the nearly complete absence of iowait.
>>
>>                         sda and sdb are the OSDs RAIDs, sdc and sdd
>>                         are the journal SSDs.
>>                         Look at these numbers, the lack of queues,
>>                         the low wait and service
>>                         times (this is in ms) plus overall utilization.
>>
>>                         The only conclusion I can draw from these
>>                         numbers and the network
>>                         results below is that the latency happens
>>                         within the OSD processes.
>>
>>                         Regards,
>>
>>                         Christian
>>
>>                             When I suggested other tests, I meant
>>                             with and without Ceph. One
>>                             particular one is OSD bench. That should
>>                             be interesting to try at a
>>                             variety of block sizes. You could also
>>                             try runnin RADOS bench and
>>                             smalliobench at a few different sizes.
>>                             -Greg
>>
>>                             On Wednesday, May 7, 2014, Alexandre
>>                             DERUMIER <aderumier at odiso.com>
>>                             wrote:
>>
>>                                 Hi Christian,
>>
>>                                 Do you have tried without raid6, to
>>                                 have more osd ?
>>                                 (how many disks do you have begin the
>>                                 raid6 ?)
>>
>>
>>                                 Aslo, I known that direct ios can be
>>                                 quite slow with ceph,
>>
>>                                 maybe can you try without --direct=1
>>
>>                                 and also enable rbd_cache
>>
>>                                 ceph.conf
>>                                 [client]
>>                                 rbd cache = true
>>
>>
>>
>>
>>                                 ----- Mail original -----
>>
>>                                 De: "Christian Balzer" <chibi at gol.com
>>                                 <javascript:;>>
>>                                 ?: "Gregory Farnum" <greg at inktank.com
>>                                 <javascript:;>>,
>>                                 ceph-users at lists.ceph.com <javascript:;>
>>                                 Envoy?: Jeudi 8 Mai 2014 04:49:16
>>                                 Objet: Re: Slow IOPS on
>>                                 RBD compared to journal and
>>                                 backing devices
>>
>>                                 On Wed, 7 May 2014 18:37:48 -0700
>>                                 Gregory Farnum wrote:
>>
>>                                     On Wed, May 7, 2014 at 5:57 PM,
>>                                     Christian Balzer
>>                                     <chibi at gol.com<javascript:;>>
>>
>>                                 wrote:
>>
>>                                         Hello,
>>
>>                                         ceph 0.72 on Debian Jessie, 2
>>                                         storage nodes with 2 OSDs
>>                                         each. The
>>                                         journals are on (separate) DC
>>                                         3700s, the actual OSDs are RAID6
>>                                         behind an Areca 1882 with 4GB
>>                                         of cache.
>>
>>                                         Running this fio:
>>
>>                                         fio --size=400m
>>                                         --ioengine=libaio
>>                                         --invalidate=1 --direct=1
>>                                         --numjobs=1 --rw=randwrite
>>                                         --name=fiojob --blocksize=4k
>>                                         --iodepth=128
>>
>>                                         results in:
>>
>>                                         30k IOPS on the journal SSD
>>                                         (as expected)
>>                                         110k IOPS on the OSD (it fits
>>                                         neatly into the cache, no
>>                                         surprise
>>                                         there) 3200 IOPS from a VM
>>                                         using userspace RBD
>>                                         2900 IOPS from a host
>>                                         kernelspace mounted RBD
>>
>>                                         When running the fio from the
>>                                         VM RBD the utilization of the
>>                                         journals is about 20% (2400
>>                                         IOPS) and the OSDs are bored
>>                                         at 2%
>>                                         (1500 IOPS after some obvious
>>                                         merging).
>>                                         The OSD processes are quite
>>                                         busy, reading well over 200%
>>                                         on atop,
>>                                         but the system is not CPU or
>>                                         otherwise resource starved at
>>                                         that
>>                                         moment.
>>
>>                                         Running multiple instances of
>>                                         this test from several VMs on
>>                                         different hosts changes
>>                                         nothing, as in the aggregated
>>                                         IOPS for
>>                                         the whole cluster will still
>>                                         be around 3200 IOPS.
>>
>>                                         Now clearly RBD has to deal
>>                                         with latency here, but the
>>                                         network is
>>                                         IPoIB with the associated low
>>                                         latency and the journal SSDs are
>>                                         the (consistently) fasted
>>                                         ones around.
>>
>>                                         I guess what I am wondering
>>                                         about is if this is normal
>>                                         and to be
>>                                         expected or if not where all
>>                                         that potential performance
>>                                         got lost.
>>
>>                                     Hmm, with 128 IOs at a time (I
>>                                     believe I'm reading that correctly?)
>>
>>                                 Yes, but going down to 32 doesn't
>>                                 change things one iota.
>>                                 Also note the multiple instances I
>>                                 mention up there, so that would
>>                                 be 256 IOs at a time, coming from
>>                                 different hosts over different
>>                                 links and nothing changes.
>>
>>                                     that's about 40ms of latency per
>>                                     op (for userspace RBD), which
>>                                     seems awfully long. You should
>>                                     check what your client-side objecter
>>                                     settings are; it might be
>>                                     limiting you to fewer outstanding ops
>>                                     than that.
>>
>>                                 Googling for client-side objecter
>>                                 gives a few hits on ceph devel and
>>                                 bugs and nothing at all as far as
>>                                 configuration options are
>>                                 concerned. Care to enlighten me where
>>                                 one can find those?
>>
>>                                 Also note the kernelspace (3.13 if it
>>                                 matters) speed, which is very
>>                                 much in the same (junior league)
>>                                 ballpark.
>>
>>                                     If
>>                                     it's available to you, testing
>>                                     with Firefly or even master would be
>>                                     interesting ? there's some
>>                                     performance work that should reduce
>>                                     latencies.
>>
>>                                 Not an option, this is going into
>>                                 production next week.
>>
>>                                     But a well-tuned (or even
>>                                     default-tuned, I thought) Ceph
>>                                     cluster
>>                                     certainly doesn't require
>>                                     40ms/op, so you should probably run a
>>                                     wider array of experiments to try
>>                                     and figure out where it's coming
>>                                     from.
>>
>>                                 I think we can rule out the network,
>>                                 NPtcp gives me:
>>                                 ---
>>                                 56: 4096 bytes 1546 times --> 979.22
>>                                 Mbps in 31.91 usec
>>                                 ---
>>
>>                                 For comparison at about 512KB it
>>                                 reaches maximum throughput and
>>                                 still isn't that laggy:
>>                                 ---
>>                                 98: 524288 bytes 121 times -->
>>                                 9700.57 Mbps in 412.35 usec
>>                                 ---
>>
>>                                 So with the network performing as
>>                                 well as my lengthy experience with
>>                                 IPoIB led me to believe, what else is
>>                                 there to look at?
>>                                 The storage nodes perform just as
>>                                 expected, indicated by the local
>>                                 fio tests.
>>
>>                                 That pretty much leaves only Ceph/RBD
>>                                 to look at and I'm not really
>>                                 sure what experiments I should run on
>>                                 that. ^o^
>>
>>                                 Regards,
>>
>>                                 Christian
>>
>>                                     -Greg
>>                                     Software Engineer #42 @
>>                                     http://inktank.com | http://ceph.com
>>
>>
>>                                 --
>>                                 Christian Balzer Network/Systems Engineer
>>                                 chibi at gol.com <javascript:;> Global
>>                                 OnLine Japan/Fusion
>>                                 Communications http://www.gol.com/
>>                                 _______________________________________________
>>                                 ceph-users mailing list
>>                                 ceph-users at lists.ceph.com <javascript:;>
>>                                 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>>                 _______________________________________________
>>                 ceph-users mailing list
>>                 ceph-users at lists.ceph.com
>>                 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>             -- 
>>             Christian Balzer Network/Systems Engineer
>>             chibi at gol.com Global OnLine Japan/Fusion Communications
>>             http://www.gol.com/
>>             _______________________________________________
>>             ceph-users mailing list
>>             ceph-users at lists.ceph.com
>>             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>         _______________________________________________
>>         ceph-users mailing list
>>         ceph-users at lists.ceph.com
>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>     _______________________________________________ 
>>     ceph-users mailing list 
>>     ceph-users at lists.ceph.com 
>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/3ebd87ee/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux