Hi Josef, Thanks a lot for the quick answer. yes 32M and rand writes and also, do you get those values i guess with a MTU of 9000 or with the traditional and beloved MTU 1500? German Anders Field Storage Support Engineer Despegar.com - IT Team > --- Original message --- > Asunto: Re: Slow IOPS on RBD compared to > journalandbackingdevices > De: Josef Johansson <josef at oderland.se> > Para: <ceph-users at lists.ceph.com> > Fecha: Wednesday, 14/05/2014 10:10 > > > Hi, > > On 14/05/14 14:45, German Anders wrote: > >> I forgot to mention, of course on a 10GbE network >> >> >> >> German Anders >> Field Storage Support Engineer >> Despegar.com - IT Team >> >> >> >> >> >> >> >> >> >> >>> --- Original message --- >>> Asunto: Re: Slow IOPS on RBD compared to journal >>> andbackingdevices >>> De: German Anders <ganders at despegar.com> >>> Para: Christian Balzer <chibi at gol.com> >>> Cc: <ceph-users at lists.ceph.com> >>> Fecha: Wednesday, 14/05/2014 09:41 >>> >>> >>> Someone could get a performance throughput on RBD of >>> 600MB/s or more on (rw) with a block size of 32768k? >>> >>> Is that 32M then? > Sequential or randwrite? > > I get about those speeds when doing (1M block size) buffered writes > from within a VM on 20GbE. The cluster max out at about 900MB/s. > > Cheers, > Josef > >> >>> >>> >>> >>> German Anders >>> Field Storage Support Engineer >>> Despegar.com - IT Team >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> --- Original message --- >>>> Asunto: Re: Slow IOPS on RBD compared to >>>> journal and backingdevices >>>> De: Christian Balzer <chibi at gol.com> >>>> Para: Josef Johansson <josef at oderland.se> >>>> Cc: <ceph-users at lists.ceph.com> >>>> Fecha: Wednesday, 14/05/2014 09:33 >>>> >>>> >>>> Hello! >>>> >>>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote: >>>> >>>> >>>>> Hi Christian, >>>>> >>>>> I missed this thread, haven't been reading the list that >>>>> well the last >>>>> weeks. >>>>> >>>>> You already know my setup, since we discussed it in an >>>>> earlier thread. I >>>>> don't have a fast backing store, but I see the slow IOPS >>>>> when doing >>>>> randwrite inside the VM, with rbd cache. Still running >>>>> dumpling here >>>>> though. >>>>> >>>>> Nods, I do recall that thread. >>>> >>>> >>>>> A thought struck me that I could test with a pool that >>>>> consists of OSDs >>>>> that have tempfs-based disks, think I have a bit more >>>>> latency than your >>>>> IPoIB but I've pushed 100k IOPS with the same network >>>>> devices before. >>>>> This would verify if the problem is with the journal >>>>> disks. I'll also >>>>> try to run the journal devices in tempfs as well, as it >>>>> would test >>>>> purely Ceph itself. >>>>> >>>>> That would be interesting indeed. >>>> Given what I've seen (with the journal at 20% utilization >>>> and the actual >>>> filestore ataround 5%) I'd expect Ceph to be the culprit. >>>> >>>> >>>>> I'll get back to you with the results, hopefully I'll >>>>> manage to get them >>>>> done during this night. >>>>> >>>>> Looking forward to that. ^^ >>>> >>>> >>>> Christian >>>> >>>>> Cheers, >>>>> Josef >>>>> >>>>> On 13/05/14 11:03, Christian Balzer wrote: >>>>> >>>>>> I'm clearly talking to myself, but whatever. >>>>>> >>>>>> For Greg, I've played with all the pertinent journal and >>>>>> filestore >>>>>> options and TCP nodelay, no changes at all. >>>>>> >>>>>> Is there anybody on this ML who's running a Ceph cluster >>>>>> with a fast >>>>>> network and FAST filestore, so like me with a big HW >>>>>> cache in front of >>>>>> a RAID/JBODs or using SSDs for final storage? >>>>>> >>>>>> If so, what results do you get out of the fio statement >>>>>> below per OSD? >>>>>> In my case with 4 OSDs and 3200 IOPS that's about 800 >>>>>> IOPS per OSD, >>>>>> which is of course vastly faster than the normal >>>>>> indvidual HDDs could >>>>>> do. >>>>>> >>>>>> So I'm wondering if I'm hitting some inherent limitation >>>>>> of how fast a >>>>>> single OSD (as in the software) can handle IOPS, given >>>>>> that everything >>>>>> else has been ruled out from where I stand. >>>>>> >>>>>> This would also explain why none of the option changes >>>>>> or the use of >>>>>> RBD caching has any measurable effect in the test case >>>>>> below. >>>>>> As in, a slow OSD aka single HDD with journal on the >>>>>> same disk would >>>>>> clearly benefit from even the small 32MB standard RBD >>>>>> cache, while in >>>>>> my test case the only time the caching becomes >>>>>> noticeable is if I >>>>>> increase the cache size to something larger than the >>>>>> test data size. >>>>>> ^o^ >>>>>> >>>>>> On the other hand if people here regularly get thousands >>>>>> or tens of >>>>>> thousands IOPS per OSD with the appropriate HW I'm >>>>>> stumped. >>>>>> >>>>>> Christian >>>>>> >>>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer >>>>>> wrote: >>>>>> >>>>>> >>>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> Oh, I didn't notice that. I bet you aren't getting >>>>>>>> the expected >>>>>>>> throughput on the RAID array with OSD access >>>>>>>> patterns, and that's >>>>>>>> applying back pressure on the journal. >>>>>>>> >>>>>>>> In the a "picture" being worth a thousand words >>>>>>>> tradition, I give you >>>>>>> this iostat -x output taken during a fio run: >>>>>>> >>>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>>> 50.82 0.00 19.43 0.17 0.00 29.58 >>>>>>> >>>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>>> sda 0.00 51.50 0.00 1633.50 0.00 7460.00 >>>>>>> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb >>>>>>> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 >>>>>>> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 >>>>>>> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 >>>>>>> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 >>>>>>> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 >>>>>>> >>>>>>> The %user CPU utilization is pretty much entirely the >>>>>>> 2 OSD processes, >>>>>>> note the nearly complete absence of iowait. >>>>>>> >>>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the >>>>>>> journal SSDs. >>>>>>> Look at these numbers, the lack of queues, the low >>>>>>> wait and service >>>>>>> times (this is in ms) plus overall utilization. >>>>>>> >>>>>>> The only conclusion I can draw from these numbers and >>>>>>> the network >>>>>>> results below is that the latency happens within the >>>>>>> OSD processes. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Christian >>>>>>> >>>>>>>> When I suggested other tests, I meant with and >>>>>>>> without Ceph. One >>>>>>>> particular one is OSD bench. That should be >>>>>>>> interesting to try at a >>>>>>>> variety of block sizes. You could also try runnin >>>>>>>> RADOS bench and >>>>>>>> smalliobench at a few different sizes. >>>>>>>> -Greg >>>>>>>> >>>>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Hi Christian, >>>>>>>>> >>>>>>>>> Do you have tried without raid6, to have more osd >>>>>>>>> ? >>>>>>>>> (how many disks do you have begin the raid6 ?) >>>>>>>>> >>>>>>>>> >>>>>>>>> Aslo, I known that direct ios can be quite slow >>>>>>>>> with ceph, >>>>>>>>> >>>>>>>>> maybe can you try without --direct=1 >>>>>>>>> >>>>>>>>> and also enable rbd_cache >>>>>>>>> >>>>>>>>> ceph.conf >>>>>>>>> [client] >>>>>>>>> rbd cache = true >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ----- Mail original ----- >>>>>>>>> >>>>>>>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>> >>>>>>>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, >>>>>>>>> ceph-users at lists.ceph.com <javascript:;> >>>>>>>>> Envoy?: Jeudi 8 Mai 2014 04:49:16 >>>>>>>>> Objet: Re: Slow IOPS on RBD compared >>>>>>>>> to journal and >>>>>>>>> backing devices >>>>>>>>> >>>>>>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Wed, May 7, 2014 at 5:57 PM, Christian >>>>>>>>>> Balzer >>>>>>>>>> <chibi at gol.com<javascript:;>> >>>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>> > ceph 0.72 on Debian Jessie, 2 storage nodes >>>>>>>>>> with 2 OSDs each. The >>>>>>>>>> > journals are on (separate) DC 3700s, the >>>>>>>>>> actual OSDs are RAID6 >>>>>>>>>>> behind an Areca 1882 with 4GB of cache. >>>>>>>>>>> >>>>>>>>>>> Running this fio: >>>>>>>>>>> >>>>>>>>>> > fio --size=400m --ioengine=libaio >>>>>>>>>> --invalidate=1 --direct=1 >>>>>>>>>> > --numjobs=1 --rw=randwrite --name=fiojob >>>>>>>>>> --blocksize=4k >>>>>>>>>>> --iodepth=128 >>>>>>>>>>> >>>>>>>>>>> results in: >>>>>>>>>>> >>>>>>>>>>> 30k IOPS on the journal SSD (as expected) >>>>>>>>>> > 110k IOPS on the OSD (it fits neatly into the >>>>>>>>>> cache, no surprise >>>>>>>>>>> there) 3200 IOPS from a VM using userspace RBD >>>>>>>>>>> 2900 IOPS from a host kernelspace mounted RBD >>>>>>>>>>> >>>>>>>>>> > When running the fio from the VM RBD the >>>>>>>>>> utilization of the >>>>>>>>>> > journals is about 20% (2400 IOPS) and the OSDs >>>>>>>>>> are bored at 2% >>>>>>>>>>> (1500 IOPS after some obvious merging). >>>>>>>>>> > The OSD processes are quite busy, reading well >>>>>>>>>> over 200% on atop, >>>>>>>>>> > but the system is not CPU or otherwise >>>>>>>>>> resource starved at that >>>>>>>>>>> moment. >>>>>>>>>>> >>>>>>>>>> > Running multiple instances of this test from >>>>>>>>>> several VMs on >>>>>>>>>> > different hosts changes nothing, as in the >>>>>>>>>> aggregated IOPS for >>>>>>>>>> > the whole cluster will still be around 3200 >>>>>>>>>> IOPS. >>>>>>>>>>> >>>>>>>>>> > Now clearly RBD has to deal with latency here, >>>>>>>>>> but the network is >>>>>>>>>> > IPoIB with the associated low latency and the >>>>>>>>>> journal SSDs are >>>>>>>>>>> the (consistently) fasted ones around. >>>>>>>>>>> >>>>>>>>>> > I guess what I am wondering about is if this >>>>>>>>>> is normal and to be >>>>>>>>>> > expected or if not where all that potential >>>>>>>>>> performance got lost. >>>>>>>>>> > Hmm, with 128 IOs at a time (I believe I'm >>>>>>>>>> reading that correctly?) >>>>>>>>>> Yes, but going down to 32 doesn't change things >>>>>>>>>> one iota. >>>>>>>>> Also note the multiple instances I mention up >>>>>>>>> there, so that would >>>>>>>>> be 256 IOs at a time, coming from different hosts >>>>>>>>> over different >>>>>>>>> links and nothing changes. >>>>>>>>> >>>>>>>>> >>>>>>>>>> that's about 40ms of latency per op (for >>>>>>>>>> userspace RBD), which >>>>>>>>>> seems awfully long. You should check what your >>>>>>>>>> client-side objecter >>>>>>>>>> settings are; it might be limiting you to fewer >>>>>>>>>> outstanding ops >>>>>>>>>> than that. >>>>>>>>>> Googling for client-side objecter gives a few hits >>>>>>>>>> on ceph devel and >>>>>>>>> bugs and nothing at all as far as configuration >>>>>>>>> options are >>>>>>>>> concerned. Care to enlighten me where one can find >>>>>>>>> those? >>>>>>>>> >>>>>>>>> Also note the kernelspace (3.13 if it matters) >>>>>>>>> speed, which is very >>>>>>>>> much in the same (junior league) ballpark. >>>>>>>>> >>>>>>>>> >>>>>>>>>> If >>>>>>>>>> it's available to you, testing with Firefly or >>>>>>>>>> even master would be >>>>>>>>>> interesting ? there's some performance work that >>>>>>>>>> should reduce >>>>>>>>>> latencies. >>>>>>>>>> >>>>>>>>>> Not an option, this is going into production next >>>>>>>>>> week. >>>>>>>>> >>>>>>>>> >>>>>>>>>> But a well-tuned (or even default-tuned, I >>>>>>>>>> thought) Ceph cluster >>>>>>>>>> certainly doesn't require 40ms/op, so you should >>>>>>>>>> probably run a >>>>>>>>>> wider array of experiments to try and figure out >>>>>>>>>> where it's coming >>>>>>>>>> from. >>>>>>>>>> I think we can rule out the network, NPtcp gives >>>>>>>>>> me: >>>>>>>>> --- >>>>>>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in >>>>>>>>> 31.91 usec >>>>>>>>> --- >>>>>>>>> >>>>>>>>> For comparison at about 512KB it reaches maximum >>>>>>>>> throughput and >>>>>>>>> still isn't that laggy: >>>>>>>>> --- >>>>>>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in >>>>>>>>> 412.35 usec >>>>>>>>> --- >>>>>>>>> >>>>>>>>> So with the network performing as well as my >>>>>>>>> lengthy experience with >>>>>>>>> IPoIB led me to believe, what else is there to >>>>>>>>> look at? >>>>>>>>> The storage nodes perform just as expected, >>>>>>>>> indicated by the local >>>>>>>>> fio tests. >>>>>>>>> >>>>>>>>> That pretty much leaves only Ceph/RBD to look at >>>>>>>>> and I'm not really >>>>>>>>> sure what experiments I should run on that. ^o^ >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> Christian >>>>>>>>> >>>>>>>>> >>>>>>>>>> -Greg >>>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Christian Balzer Network/Systems Engineer >>>>>>>>> chibi at gol.com <javascript:;> Global OnLine >>>>>>>>> Japan/Fusion >>>>>>>>> Communications http://www.gol.com/ >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users at lists.ceph.com <javascript:;> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users at lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>>> -- >>>> Christian Balzer Network/Systems Engineer >>>> chibi at gol.com Global OnLine Japan/Fusion Communications >>>> http://www.gol.com/ >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users at lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users at lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> >> >> >> _______________________________________________ ceph-users mailing list ceph-users at lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/58bcbb2b/attachment.htm>