I forgot to mention, of course on a 10GbE network German Anders Field Storage Support Engineer Despegar.com - IT Team > --- Original message --- > Asunto: Re: Slow IOPS on RBD compared to journal > andbackingdevices > De: German Anders <ganders at despegar.com> > Para: Christian Balzer <chibi at gol.com> > Cc: <ceph-users at lists.ceph.com> > Fecha: Wednesday, 14/05/2014 09:41 > > > Someone could get a performance throughput on RBD of 600MB/s or more > on (rw) with a block size of 32768k? > > > > German Anders > Field Storage Support Engineer > Despegar.com - IT Team > > > > > > > > > > >> --- Original message --- >> Asunto: Re: Slow IOPS on RBD compared to journal and >> backingdevices >> De: Christian Balzer <chibi at gol.com> >> Para: Josef Johansson <josef at oderland.se> >> Cc: <ceph-users at lists.ceph.com> >> Fecha: Wednesday, 14/05/2014 09:33 >> >> >> Hello! >> >> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote: >> >>> >>> Hi Christian, >>> >>> I missed this thread, haven't been reading the list that well the last >>> weeks. >>> >>> You already know my setup, since we discussed it in an earlier thread. >>> I >>> don't have a fast backing store, but I see the slow IOPS when doing >>> randwrite inside the VM, with rbd cache. Still running dumpling here >>> though. >>> >> Nods, I do recall that thread. >> >>> >>> A thought struck me that I could test with a pool that consists of >>> OSDs >>> that have tempfs-based disks, think I have a bit more latency than >>> your >>> IPoIB but I've pushed 100k IOPS with the same network devices before. >>> This would verify if the problem is with the journal disks. I'll also >>> try to run the journal devices in tempfs as well, as it would test >>> purely Ceph itself. >>> >> That would be interesting indeed. >> Given what I've seen (with the journal at 20% utilization and the >> actual >> filestore ataround 5%) I'd expect Ceph to be the culprit. >> >>> >>> I'll get back to you with the results, hopefully I'll manage to get >>> them >>> done during this night. >>> >> Looking forward to that. ^^ >> >> >> Christian >>> >>> Cheers, >>> Josef >>> >>> On 13/05/14 11:03, Christian Balzer wrote: >>>> >>>> I'm clearly talking to myself, but whatever. >>>> >>>> For Greg, I've played with all the pertinent journal and filestore >>>> options and TCP nodelay, no changes at all. >>>> >>>> Is there anybody on this ML who's running a Ceph cluster with a fast >>>> network and FAST filestore, so like me with a big HW cache in front of >>>> a RAID/JBODs or using SSDs for final storage? >>>> >>>> If so, what results do you get out of the fio statement below per OSD? >>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, >>>> which is of course vastly faster than the normal indvidual HDDs could >>>> do. >>>> >>>> So I'm wondering if I'm hitting some inherent limitation of how fast a >>>> single OSD (as in the software) can handle IOPS, given that everything >>>> else has been ruled out from where I stand. >>>> >>>> This would also explain why none of the option changes or the use of >>>> RBD caching has any measurable effect in the test case below. >>>> As in, a slow OSD aka single HDD with journal on the same disk would >>>> clearly benefit from even the small 32MB standard RBD cache, while in >>>> my test case the only time the caching becomes noticeable is if I >>>> increase the cache size to something larger than the test data size. >>>> ^o^ >>>> >>>> On the other hand if people here regularly get thousands or tens of >>>> thousands IOPS per OSD with the appropriate HW I'm stumped. >>>> >>>> Christian >>>> >>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: >>>> >>>>> >>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: >>>>> >>>>>> >>>>>> Oh, I didn't notice that. I bet you aren't getting the expected >>>>>> throughput on the RAID array with OSD access patterns, and that's >>>>>> applying back pressure on the journal. >>>>>> >>>>> In the a "picture" being worth a thousand words tradition, I give you >>>>> this iostat -x output taken during a fio run: >>>>> >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 50.82 0.00 19.43 0.17 0.00 29.58 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> sda 0.00 51.50 0.00 1633.50 0.00 7460.00 >>>>> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb >>>>> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 >>>>> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 >>>>> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 >>>>> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 >>>>> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 >>>>> >>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes, >>>>> note the nearly complete absence of iowait. >>>>> >>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. >>>>> Look at these numbers, the lack of queues, the low wait and service >>>>> times (this is in ms) plus overall utilization. >>>>> >>>>> The only conclusion I can draw from these numbers and the network >>>>> results below is that the latency happens within the OSD processes. >>>>> >>>>> Regards, >>>>> >>>>> Christian >>>>>> >>>>>> When I suggested other tests, I meant with and without Ceph. One >>>>>> particular one is OSD bench. That should be interesting to try at a >>>>>> variety of block sizes. You could also try runnin RADOS bench and >>>>>> smalliobench at a few different sizes. >>>>>> -Greg >>>>>> >>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> Hi Christian, >>>>>>> >>>>>>> Do you have tried without raid6, to have more osd ? >>>>>>> (how many disks do you have begin the raid6 ?) >>>>>>> >>>>>>> >>>>>>> Aslo, I known that direct ios can be quite slow with ceph, >>>>>>> >>>>>>> maybe can you try without --direct=1 >>>>>>> >>>>>>> and also enable rbd_cache >>>>>>> >>>>>>> ceph.conf >>>>>>> [client] >>>>>>> rbd cache = true >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> >>>>>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>> >>>>>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, >>>>>>> ceph-users at lists.ceph.com <javascript:;> >>>>>>> Envoy?: Jeudi 8 Mai 2014 04:49:16 >>>>>>> Objet: Re: Slow IOPS on RBD compared to journal and >>>>>>> backing devices >>>>>>> >>>>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: >>>>>>> >>>>>>>> >>>>>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer >>>>>>>> <chibi at gol.com<javascript:;>> >>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The >>>>>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6 >>>>>>>>> behind an Areca 1882 with 4GB of cache. >>>>>>>>> >>>>>>>>> Running this fio: >>>>>>>>> >>>>>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 >>>>>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k >>>>>>>>> --iodepth=128 >>>>>>>>> >>>>>>>>> results in: >>>>>>>>> >>>>>>>>> 30k IOPS on the journal SSD (as expected) >>>>>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise >>>>>>>>> there) 3200 IOPS from a VM using userspace RBD >>>>>>>>> 2900 IOPS from a host kernelspace mounted RBD >>>>>>>>> >>>>>>>>> When running the fio from the VM RBD the utilization of the >>>>>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2% >>>>>>>>> (1500 IOPS after some obvious merging). >>>>>>>>> The OSD processes are quite busy, reading well over 200% on atop, >>>>>>>>> but the system is not CPU or otherwise resource starved at that >>>>>>>>> moment. >>>>>>>>> >>>>>>>>> Running multiple instances of this test from several VMs on >>>>>>>>> different hosts changes nothing, as in the aggregated IOPS for >>>>>>>>> the whole cluster will still be around 3200 IOPS. >>>>>>>>> >>>>>>>>> Now clearly RBD has to deal with latency here, but the network is >>>>>>>>> IPoIB with the associated low latency and the journal SSDs are >>>>>>>>> the (consistently) fasted ones around. >>>>>>>>> >>>>>>>>> I guess what I am wondering about is if this is normal and to be >>>>>>>>> expected or if not where all that potential performance got lost. >>>>>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) >>>>>>> Yes, but going down to 32 doesn't change things one iota. >>>>>>> Also note the multiple instances I mention up there, so that would >>>>>>> be 256 IOs at a time, coming from different hosts over different >>>>>>> links and nothing changes. >>>>>>> >>>>>>>> >>>>>>>> that's about 40ms of latency per op (for userspace RBD), which >>>>>>>> seems awfully long. You should check what your client-side objecter >>>>>>>> settings are; it might be limiting you to fewer outstanding ops >>>>>>>> than that. >>>>>>> Googling for client-side objecter gives a few hits on ceph devel and >>>>>>> bugs and nothing at all as far as configuration options are >>>>>>> concerned. Care to enlighten me where one can find those? >>>>>>> >>>>>>> Also note the kernelspace (3.13 if it matters) speed, which is very >>>>>>> much in the same (junior league) ballpark. >>>>>>> >>>>>>>> >>>>>>>> If >>>>>>>> it's available to you, testing with Firefly or even master would be >>>>>>>> interesting ? there's some performance work that should reduce >>>>>>>> latencies. >>>>>>>> >>>>>>> Not an option, this is going into production next week. >>>>>>> >>>>>>>> >>>>>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster >>>>>>>> certainly doesn't require 40ms/op, so you should probably run a >>>>>>>> wider array of experiments to try and figure out where it's coming >>>>>>>> from. >>>>>>> I think we can rule out the network, NPtcp gives me: >>>>>>> --- >>>>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec >>>>>>> --- >>>>>>> >>>>>>> For comparison at about 512KB it reaches maximum throughput and >>>>>>> still isn't that laggy: >>>>>>> --- >>>>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec >>>>>>> --- >>>>>>> >>>>>>> So with the network performing as well as my lengthy experience with >>>>>>> IPoIB led me to believe, what else is there to look at? >>>>>>> The storage nodes perform just as expected, indicated by the local >>>>>>> fio tests. >>>>>>> >>>>>>> That pretty much leaves only Ceph/RBD to look at and I'm not really >>>>>>> sure what experiments I should run on that. ^o^ >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Christian >>>>>>> >>>>>>>> >>>>>>>> -Greg >>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Christian Balzer Network/Systems Engineer >>>>>>> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion >>>>>>> Communications http://www.gol.com/ >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users at lists.ceph.com <javascript:;> >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi at gol.com Global OnLine Japan/Fusion Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/14c22133/attachment.htm>