Slow IOPS on RBD compared to journal andbackingdevices

ganders@xxxxxxxxxxxx (German Anders) · Wed, 14 May 2014 08:45:18 -0400



I forgot to mention, of course on a 10GbE network


German Anders
Field Storage Support Engineer
Despegar.com - IT Team


> --- Original message ---
> Asunto: Re: Slow IOPS on RBD compared to journal 
> andbackingdevices
> De: German Anders <ganders at despegar.com>
> Para: Christian Balzer <chibi at gol.com>
> Cc: <ceph-users at lists.ceph.com>
> Fecha: Wednesday, 14/05/2014 09:41
>
>
> Someone could get a performance throughput on RBD of 600MB/s or more 
> on (rw) with a block size of 32768k?
>
>
>
> German Anders
> Field Storage Support Engineer
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>
>
>> --- Original message ---
>> Asunto: Re: Slow IOPS on RBD compared to journal and 
>> backingdevices
>> De: Christian Balzer <chibi at gol.com>
>> Para: Josef Johansson <josef at oderland.se>
>> Cc: <ceph-users at lists.ceph.com>
>> Fecha: Wednesday, 14/05/2014 09:33
>>
>>
>> Hello!
>>
>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>>>
>>> Hi Christian,
>>>
>>> I missed this thread, haven't been reading the list that well the last
>>> weeks.
>>>
>>> You already know my setup, since we discussed it in an earlier thread. 
>>> I
>>> don't have a fast backing store, but I see the slow IOPS when doing
>>> randwrite inside the VM, with rbd cache. Still running dumpling here
>>> though.
>>>
>> Nods, I do recall that thread.
>>
>>>
>>> A thought struck me that I could test with a pool that consists of 
>>> OSDs
>>> that have tempfs-based disks, think I have a bit more latency than 
>>> your
>>> IPoIB but I've pushed 100k IOPS with the same network devices before.
>>> This would verify if the problem is with the journal disks. I'll also
>>> try to run the journal devices in tempfs as well, as it would test
>>> purely Ceph itself.
>>>
>> That would be interesting indeed.
>> Given what I've seen (with the journal at 20% utilization and the 
>> actual
>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>>>
>>> I'll get back to you with the results, hopefully I'll manage to get 
>>> them
>>> done during this night.
>>>
>> Looking forward to that. ^^
>>
>>
>> Christian
>>>
>>> Cheers,
>>> Josef
>>>
>>> On 13/05/14 11:03, Christian Balzer wrote:
>>>>
>>>> I'm clearly talking to myself, but whatever.
>>>>
>>>> For Greg, I've played with all the pertinent journal and filestore
>>>> options and TCP nodelay, no changes at all.
>>>>
>>>> Is there anybody on this ML who's running a Ceph cluster with a fast
>>>> network and FAST filestore, so like me with a big HW cache in front of
>>>> a RAID/JBODs or using SSDs for final storage?
>>>>
>>>> If so, what results do you get out of the fio statement below per OSD?
>>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
>>>> which is of course vastly faster than the normal indvidual HDDs could
>>>> do.
>>>>
>>>> So I'm wondering if I'm hitting some inherent limitation of how fast a
>>>> single OSD (as in the software) can handle IOPS, given that everything
>>>> else has been ruled out from where I stand.
>>>>
>>>> This would also explain why none of the option changes or the use of
>>>> RBD caching has any measurable effect in the test case below.
>>>> As in, a slow OSD aka single HDD with journal on the same disk would
>>>> clearly benefit from even the small 32MB standard RBD cache, while in
>>>> my test case the only time the caching becomes noticeable is if I
>>>> increase the cache size to something larger than the test data size.
>>>> ^o^
>>>>
>>>> On the other hand if people here regularly get thousands or tens of
>>>> thousands IOPS per OSD with the appropriate HW I'm stumped.
>>>>
>>>> Christian
>>>>
>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>>>>
>>>>>
>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>>>>
>>>>>>
>>>>>> Oh, I didn't notice that. I bet you aren't getting the expected
>>>>>> throughput on the RAID array with OSD access patterns, and that's
>>>>>> applying back pressure on the journal.
>>>>>>
>>>>> In the a "picture" being worth a thousand words tradition, I give you
>>>>> this iostat -x output taken during a fio run:
>>>>>
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>                      50.82    0.00   19.43    0.17    0.00   29.58
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> sda               0.00    51.50    0.00 1633.50     0.00  7460.00
>>>>> 9.13     0.18    0.11    0.00    0.11   0.01   1.40 sdb
>>>>> 0.00     0.00    0.00 1240.50     0.00  5244.00     8.45     0.30
>>>>> 0.25    0.00    0.25   0.02   2.00 sdc               0.00     5.00
>>>>> 0.00 2468.50     0.00 13419.00    10.87     0.24    0.10    0.00
>>>>> 0.10   0.09  22.00 sdd               0.00     6.50    0.00 1913.00
>>>>> 0.00 10313.00    10.78     0.20    0.10    0.00    0.10   0.09  16.60
>>>>>
>>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes,
>>>>> note the nearly complete absence of iowait.
>>>>>
>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
>>>>> Look at these numbers, the lack of queues, the low wait and service
>>>>> times (this is in ms) plus overall utilization.
>>>>>
>>>>> The only conclusion I can draw from these numbers and the network
>>>>> results below is that the latency happens within the OSD processes.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Christian
>>>>>>
>>>>>> When I suggested other tests, I meant with and without Ceph. One
>>>>>> particular one is OSD bench. That should be interesting to try at a
>>>>>> variety of block sizes. You could also try runnin RADOS bench and
>>>>>> smalliobench at a few different sizes.
>>>>>> -Greg
>>>>>>
>>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> Do you have tried without raid6, to have more osd ?
>>>>>>> (how many disks do you have begin the raid6 ?)
>>>>>>>
>>>>>>>
>>>>>>> Aslo, I known that direct ios can be quite slow with ceph,
>>>>>>>
>>>>>>> maybe can you try without --direct=1
>>>>>>>
>>>>>>> and also enable rbd_cache
>>>>>>>
>>>>>>> ceph.conf
>>>>>>> [client]
>>>>>>> rbd cache = true
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>>
>>>>>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>>
>>>>>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>,
>>>>>>> ceph-users at lists.ceph.com <javascript:;>
>>>>>>> Envoy?: Jeudi 8 Mai 2014 04:49:16
>>>>>>> Objet: Re: Slow IOPS on RBD compared to journal and
>>>>>>> backing devices
>>>>>>>
>>>>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
>>>>>>>> <chibi at gol.com<javascript:;>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
>>>>>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6
>>>>>>>>> behind an Areca 1882 with 4GB of cache.
>>>>>>>>>
>>>>>>>>> Running this fio:
>>>>>>>>>
>>>>>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
>>>>>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
>>>>>>>>> --iodepth=128
>>>>>>>>>
>>>>>>>>> results in:
>>>>>>>>>
>>>>>>>>> 30k IOPS on the journal SSD (as expected)
>>>>>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise
>>>>>>>>> there) 3200 IOPS from a VM using userspace RBD
>>>>>>>>> 2900 IOPS from a host kernelspace mounted RBD
>>>>>>>>>
>>>>>>>>> When running the fio from the VM RBD the utilization of the
>>>>>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
>>>>>>>>> (1500 IOPS after some obvious merging).
>>>>>>>>> The OSD processes are quite busy, reading well over 200% on atop,
>>>>>>>>> but the system is not CPU or otherwise resource starved at that
>>>>>>>>> moment.
>>>>>>>>>
>>>>>>>>> Running multiple instances of this test from several VMs on
>>>>>>>>> different hosts changes nothing, as in the aggregated IOPS for
>>>>>>>>> the whole cluster will still be around 3200 IOPS.
>>>>>>>>>
>>>>>>>>> Now clearly RBD has to deal with latency here, but the network is
>>>>>>>>> IPoIB with the associated low latency and the journal SSDs are
>>>>>>>>> the (consistently) fasted ones around.
>>>>>>>>>
>>>>>>>>> I guess what I am wondering about is if this is normal and to be
>>>>>>>>> expected or if not where all that potential performance got lost.
>>>>>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
>>>>>>> Yes, but going down to 32 doesn't change things one iota.
>>>>>>> Also note the multiple instances I mention up there, so that would
>>>>>>> be 256 IOs at a time, coming from different hosts over different
>>>>>>> links and nothing changes.
>>>>>>>
>>>>>>>>
>>>>>>>> that's about 40ms of latency per op (for userspace RBD), which
>>>>>>>> seems awfully long. You should check what your client-side objecter
>>>>>>>> settings are; it might be limiting you to fewer outstanding ops
>>>>>>>> than that.
>>>>>>> Googling for client-side objecter gives a few hits on ceph devel and
>>>>>>> bugs and nothing at all as far as configuration options are
>>>>>>> concerned. Care to enlighten me where one can find those?
>>>>>>>
>>>>>>> Also note the kernelspace (3.13 if it matters) speed, which is very
>>>>>>> much in the same (junior league) ballpark.
>>>>>>>
>>>>>>>>
>>>>>>>> If
>>>>>>>> it's available to you, testing with Firefly or even master would be
>>>>>>>> interesting ? there's some performance work that should reduce
>>>>>>>> latencies.
>>>>>>>>
>>>>>>> Not an option, this is going into production next week.
>>>>>>>
>>>>>>>>
>>>>>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster
>>>>>>>> certainly doesn't require 40ms/op, so you should probably run a
>>>>>>>> wider array of experiments to try and figure out where it's coming
>>>>>>>> from.
>>>>>>> I think we can rule out the network, NPtcp gives me:
>>>>>>> ---
>>>>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec
>>>>>>> ---
>>>>>>>
>>>>>>> For comparison at about 512KB it reaches maximum throughput and
>>>>>>> still isn't that laggy:
>>>>>>> ---
>>>>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec
>>>>>>> ---
>>>>>>>
>>>>>>> So with the network performing as well as my lengthy experience with
>>>>>>> IPoIB led me to believe, what else is there to look at?
>>>>>>> The storage nodes perform just as expected, indicated by the local
>>>>>>> fio tests.
>>>>>>>
>>>>>>> That pretty much leaves only Ceph/RBD to look at and I'm not really
>>>>>>> sure what experiments I should run on that. ^o^
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Christian
>>>>>>>
>>>>>>>>
>>>>>>>> -Greg
>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Christian Balzer Network/Systems Engineer
>>>>>>> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion
>>>>>>> Communications http://www.gol.com/
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users at lists.ceph.com <javascript:;>
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/14c22133/attachment.htm>