Slow IOPS on RBD compared to journalandbackingdevices

ganders@xxxxxxxxxxxx (German Anders) · Wed, 14 May 2014 09:23:41 -0400

Hi Josef,
Thanks a lot for the quick answer.

yes 32M and rand writes

and also, do you get those values i guess with a MTU of 9000 or with 
the traditional and beloved MTU 1500?

German Anders
Field Storage Support Engineer
Despegar.com - IT Team

> --- Original message ---
> Asunto: Re: Slow IOPS on RBD compared to 
> journalandbackingdevices
> De: Josef Johansson <josef at oderland.se>
> Para: <ceph-users at lists.ceph.com>
> Fecha: Wednesday, 14/05/2014 10:10
>
>
> Hi,
>
> On 14/05/14 14:45, German Anders wrote:
>
>> I forgot to mention, of course on a 10GbE network
>>
>>
>>
>> German               Anders
>> Field               Storage Support Engineer
>> Despegar.com             - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> --- Original message ---
>>> Asunto: Re: Slow IOPS on RBD compared to         journal 
>>> andbackingdevices
>>> De: German Anders <ganders at despegar.com>
>>> Para: Christian Balzer <chibi at gol.com>
>>> Cc: <ceph-users at lists.ceph.com>
>>> Fecha: Wednesday, 14/05/2014 09:41
>>>
>>>
>>> Someone could get a performance               throughput on RBD of 
>>> 600MB/s or more on (rw) with a block               size of 32768k?
>>>
>>>     Is that 32M then?
> Sequential or randwrite?
>
> I get about those speeds when doing (1M block size) buffered writes    
>  from within a VM on 20GbE. The cluster max out at about 900MB/s.
>
> Cheers,
> Josef
>
>>
>>>
>>>
>>>
>>> German                   Anders
>>> Field                   Storage Support Engineer
>>> Despegar.com                 - IT Team
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> --- Original message ---
>>>> Asunto: Re: Slow IOPS on RBD compared to             
>>>> journal and backingdevices
>>>> De: Christian Balzer <chibi at gol.com>
>>>> Para: Josef Johansson <josef at oderland.se>
>>>> Cc: <ceph-users at lists.ceph.com>
>>>> Fecha: Wednesday, 14/05/2014 09:33
>>>>
>>>>
>>>> Hello!
>>>>
>>>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>>>
>>>>
>>>>> Hi Christian,
>>>>>
>>>>> I missed this thread, haven't been reading the list that               
>>>>> well the last
>>>>> weeks.
>>>>>
>>>>> You already know my setup, since we discussed it in an               
>>>>> earlier thread. I
>>>>> don't have a fast backing store, but I see the slow IOPS               
>>>>> when doing
>>>>> randwrite inside the VM, with rbd cache. Still running               
>>>>> dumpling here
>>>>> though.
>>>>>
>>>>>  Nods, I do recall that thread.
>>>>
>>>>
>>>>> A thought struck me that I could test with a               pool that 
>>>>> consists of OSDs
>>>>> that have tempfs-based disks, think I have a bit more               
>>>>> latency than your
>>>>> IPoIB but I've pushed 100k IOPS with the same network               
>>>>> devices before.
>>>>> This would verify if the problem is with the journal               
>>>>> disks. I'll also
>>>>> try to run the journal devices in tempfs as well, as it               
>>>>> would test
>>>>> purely Ceph itself.
>>>>>
>>>>>  That would be interesting indeed.
>>>> Given what I've seen (with the journal at 20% utilization             
>>>> and the actual
>>>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>>>
>>>>
>>>>> I'll get back to you with the results,               hopefully I'll 
>>>>> manage to get them
>>>>> done during this night.
>>>>>
>>>>>  Looking forward to that. ^^
>>>>
>>>>
>>>> Christian
>>>>
>>>>> Cheers,
>>>>> Josef
>>>>>
>>>>> On 13/05/14 11:03, Christian Balzer wrote:
>>>>>
>>>>>> I'm clearly talking to myself, but whatever.
>>>>>>
>>>>>> For Greg, I've played with all the pertinent journal and               
>>>>>>   filestore
>>>>>> options and TCP nodelay, no changes at all.
>>>>>>
>>>>>> Is there anybody on this ML who's running a Ceph cluster               
>>>>>>   with a fast
>>>>>> network and FAST filestore, so like me with a big HW                 
>>>>>> cache in front of
>>>>>> a RAID/JBODs or using SSDs for final storage?
>>>>>>
>>>>>> If so, what results do you get out of the fio statement                
>>>>>>  below per OSD?
>>>>>> In my case with 4 OSDs and 3200 IOPS that's about 800                 
>>>>>> IOPS per OSD,
>>>>>> which is of course vastly faster than the normal                 
>>>>>> indvidual HDDs could
>>>>>> do.
>>>>>>
>>>>>> So I'm wondering if I'm hitting some inherent limitation               
>>>>>>   of how fast a
>>>>>> single OSD (as in the software) can handle IOPS, given                 
>>>>>> that everything
>>>>>> else has been ruled out from where I stand.
>>>>>>
>>>>>> This would also explain why none of the option changes                 
>>>>>> or the use of
>>>>>> RBD caching has any measurable effect in the test case                 
>>>>>> below.
>>>>>> As in, a slow OSD aka single HDD with journal on the                 
>>>>>> same disk would
>>>>>> clearly benefit from even the small 32MB standard RBD                 
>>>>>> cache, while in
>>>>>> my test case the only time the caching becomes                 
>>>>>> noticeable is if I
>>>>>> increase the cache size to something larger than the                 
>>>>>> test data size.
>>>>>> ^o^
>>>>>>
>>>>>> On the other hand if people here regularly get thousands               
>>>>>>   or tens of
>>>>>> thousands IOPS per OSD with the appropriate HW I'm                 
>>>>>> stumped.
>>>>>>
>>>>>> Christian
>>>>>>
>>>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer                 
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory                   Farnum 
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Oh, I didn't notice that. I bet you                     aren't getting 
>>>>>>>> the expected
>>>>>>>> throughput on the RAID array with OSD access                     
>>>>>>>> patterns, and that's
>>>>>>>> applying back pressure on the journal.
>>>>>>>>
>>>>>>>>  In the a "picture" being worth a thousand words                   
>>>>>>>> tradition, I give you
>>>>>>> this iostat -x output taken during a fio run:
>>>>>>>
>>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle
>>>>>>>             50.82 0.00 19.43 0.17 0.00 29.58
>>>>>>>
>>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
>>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util
>>>>>>> sda 0.00 51.50 0.00 1633.50 0.00 7460.00
>>>>>>> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb
>>>>>>> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30
>>>>>>> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00
>>>>>>> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00
>>>>>>> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00
>>>>>>> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60
>>>>>>>
>>>>>>> The %user CPU utilization is pretty much entirely the                  
>>>>>>>  2 OSD processes,
>>>>>>> note the nearly complete absence of iowait.
>>>>>>>
>>>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the                   
>>>>>>> journal SSDs.
>>>>>>> Look at these numbers, the lack of queues, the low                   
>>>>>>> wait and service
>>>>>>> times (this is in ms) plus overall utilization.
>>>>>>>
>>>>>>> The only conclusion I can draw from these numbers and                  
>>>>>>>  the network
>>>>>>> results below is that the latency happens within the                   
>>>>>>> OSD processes.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Christian
>>>>>>>
>>>>>>>> When I suggested other tests, I meant                     with and 
>>>>>>>> without Ceph. One
>>>>>>>> particular one is OSD bench. That should be                     
>>>>>>>> interesting to try at a
>>>>>>>> variety of block sizes. You could also try runnin                     
>>>>>>>> RADOS bench and
>>>>>>>> smalliobench at a few different sizes.
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Christian,
>>>>>>>>>
>>>>>>>>> Do you have tried without raid6, to have more osd                      
>>>>>>>>>  ?
>>>>>>>>> (how many disks do you have begin the raid6 ?)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   Aslo, I known that direct ios can be quite slow                      
>>>>>>>>>  with ceph,
>>>>>>>>>
>>>>>>>>> maybe can you try without --direct=1
>>>>>>>>>
>>>>>>>>> and also enable rbd_cache
>>>>>>>>>
>>>>>>>>> ceph.conf
>>>>>>>>> [client]
>>>>>>>>> rbd cache = true
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----- Mail original -----
>>>>>>>>>
>>>>>>>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>>
>>>>>>>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
>>>>>>>>> ceph-users at lists.ceph.com <javascript:;>
>>>>>>>>> Envoy?: Jeudi 8 Mai 2014 04:49:16
>>>>>>>>> Objet: Re: Slow IOPS on RBD compared                      
>>>>>>>>>  to journal and
>>>>>>>>> backing devices
>>>>>>>>>
>>>>>>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum                       
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Wed, May 7, 2014 at 5:57 PM,                         Christian 
>>>>>>>>>> Balzer
>>>>>>>>>> <chibi at gol.com<javascript:;>>
>>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>> > ceph 0.72 on Debian Jessie, 2 storage nodes                          
>>>>>>>>>>  with 2 OSDs each. The
>>>>>>>>>> > journals are on (separate) DC 3700s, the                           
>>>>>>>>>> actual OSDs are RAID6
>>>>>>>>>>> behind an Areca 1882 with 4GB of cache.
>>>>>>>>>>>
>>>>>>>>>>> Running this fio:
>>>>>>>>>>>
>>>>>>>>>> > fio --size=400m --ioengine=libaio                           
>>>>>>>>>> --invalidate=1 --direct=1
>>>>>>>>>> > --numjobs=1 --rw=randwrite --name=fiojob                           
>>>>>>>>>> --blocksize=4k
>>>>>>>>>>> --iodepth=128
>>>>>>>>>>>
>>>>>>>>>>> results in:
>>>>>>>>>>>
>>>>>>>>>>> 30k IOPS on the journal SSD (as expected)
>>>>>>>>>> > 110k IOPS on the OSD (it fits neatly into the                        
>>>>>>>>>>    cache, no surprise
>>>>>>>>>>> there) 3200 IOPS from a VM using userspace RBD
>>>>>>>>>>> 2900 IOPS from a host kernelspace mounted RBD
>>>>>>>>>>>
>>>>>>>>>> > When running the fio from the VM RBD the                           
>>>>>>>>>> utilization of the
>>>>>>>>>> > journals is about 20% (2400 IOPS) and the OSDs                       
>>>>>>>>>>     are bored at 2%
>>>>>>>>>>> (1500 IOPS after some obvious merging).
>>>>>>>>>> > The OSD processes are quite busy, reading well                       
>>>>>>>>>>     over 200% on atop,
>>>>>>>>>> > but the system is not CPU or otherwise                           
>>>>>>>>>> resource starved at that
>>>>>>>>>>> moment.
>>>>>>>>>>>
>>>>>>>>>> > Running multiple instances of this test from                         
>>>>>>>>>>   several VMs on
>>>>>>>>>> > different hosts changes nothing, as in the                           
>>>>>>>>>> aggregated IOPS for
>>>>>>>>>> > the whole cluster will still be around 3200                          
>>>>>>>>>>  IOPS.
>>>>>>>>>>>
>>>>>>>>>> > Now clearly RBD has to deal with latency here,                       
>>>>>>>>>>     but the network is
>>>>>>>>>> > IPoIB with the associated low latency and the                        
>>>>>>>>>>    journal SSDs are
>>>>>>>>>>> the (consistently) fasted ones around.
>>>>>>>>>>>
>>>>>>>>>> > I guess what I am wondering about is if this                         
>>>>>>>>>>   is normal and to be
>>>>>>>>>> > expected or if not where all that potential                          
>>>>>>>>>>  performance got lost.
>>>>>>>>>> >  Hmm, with 128 IOs at a time (I believe I'm                         
>>>>>>>>>> reading that correctly?)
>>>>>>>>>>  Yes, but going down to 32 doesn't change things                       
>>>>>>>>>> one iota.
>>>>>>>>> Also note the multiple instances I mention up                       
>>>>>>>>> there, so that would
>>>>>>>>> be 256 IOs at a time, coming from different hosts                      
>>>>>>>>>  over different
>>>>>>>>> links and nothing changes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> that's about 40ms of latency per op                         (for 
>>>>>>>>>> userspace RBD), which
>>>>>>>>>> seems awfully long. You should check what your                         
>>>>>>>>>> client-side objecter
>>>>>>>>>> settings are; it might be limiting you to fewer                        
>>>>>>>>>>  outstanding ops
>>>>>>>>>> than that.
>>>>>>>>>>    Googling for client-side objecter gives a few hits                  
>>>>>>>>>>      on ceph devel and
>>>>>>>>> bugs and nothing at all as far as configuration                       
>>>>>>>>> options are
>>>>>>>>> concerned. Care to enlighten me where one can find                     
>>>>>>>>>   those?
>>>>>>>>>
>>>>>>>>> Also note the kernelspace (3.13 if it matters)                       
>>>>>>>>> speed, which is very
>>>>>>>>> much in the same (junior league) ballpark.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> If
>>>>>>>>>> it's available to you, testing with Firefly or                         
>>>>>>>>>> even master would be
>>>>>>>>>> interesting ? there's some performance work that                     
>>>>>>>>>>     should reduce
>>>>>>>>>> latencies.
>>>>>>>>>>
>>>>>>>>>>  Not an option, this is going into production next                     
>>>>>>>>>>   week.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> But a well-tuned (or even                         default-tuned, I 
>>>>>>>>>> thought) Ceph cluster
>>>>>>>>>> certainly doesn't require 40ms/op, so you should                       
>>>>>>>>>>   probably run a
>>>>>>>>>> wider array of experiments to try and figure out                       
>>>>>>>>>>   where it's coming
>>>>>>>>>> from.
>>>>>>>>>>  I think we can rule out the network, NPtcp gives                      
>>>>>>>>>>  me:
>>>>>>>>> ---
>>>>>>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in                       
>>>>>>>>> 31.91 usec
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> For comparison at about 512KB it reaches maximum                       
>>>>>>>>> throughput and
>>>>>>>>> still isn't that laggy:
>>>>>>>>> ---
>>>>>>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in                       
>>>>>>>>> 412.35 usec
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> So with the network performing as well as my                       
>>>>>>>>> lengthy experience with
>>>>>>>>> IPoIB led me to believe, what else is there to                       
>>>>>>>>> look at?
>>>>>>>>> The storage nodes perform just as expected,                       
>>>>>>>>> indicated by the local
>>>>>>>>> fio tests.
>>>>>>>>>
>>>>>>>>> That pretty much leaves only Ceph/RBD to look at                       
>>>>>>>>> and I'm not really
>>>>>>>>> sure what experiments I should run on that. ^o^
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Christian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -Greg
>>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Christian Balzer Network/Systems Engineer
>>>>>>>>> chibi at gol.com <javascript:;> Global OnLine                       
>>>>>>>>> Japan/Fusion
>>>>>>>>> Communications http://www.gol.com/
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users at lists.ceph.com <javascript:;>
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>> --
>>>> Christian Balzer Network/Systems Engineer
>>>> chibi at gol.com Global OnLine Japan/Fusion Communications
>>>> http://www.gol.com/
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>  _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>>
>> _______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/58bcbb2b/attachment.htm>