Hi, Yeah, running with MTU 9000 here, but the test was with sequential. Just ran rbd -p shared-1 bench-write test --io-size $((32*1024*1024)) --io-pattern rand The cluster itself showed 700MB/s write (3x replicas), but the test just 45MB/s. But I think rbd is a little bit broken ;) Cheers, Josef On 14/05/14 15:23, German Anders wrote: > Hi Josef, > Thanks a lot for the quick answer. > > yes 32M and rand writes > > and also, do you get those values i guess with a MTU of 9000 or with > the traditional and beloved MTU 1500? > > > > *German Anders* > /Field Storage Support Engineer/** > > Despegar.com - IT Team > > > > > > > > > >> --- Original message --- >> *Asunto:* Re: Slow IOPS on RBD compared to >> journalandbackingdevices >> *De:* Josef Johansson <josef at oderland.se> >> *Para:* <ceph-users at lists.ceph.com> >> *Fecha:* Wednesday, 14/05/2014 10:10 >> >> Hi, >> >> On 14/05/14 14:45, German Anders wrote: >> >> I forgot to mention, of course on a 10GbE network >> >> >> >> *German Anders* >> /Field Storage Support Engineer/** >> >> Despegar.com - IT Team >> >> >> >> >> >> >> >> >> >> >> --- Original message --- >> *Asunto:* Re: Slow IOPS on RBD compared to >> journal andbackingdevices >> *De:* German Anders <ganders at despegar.com> >> *Para:* Christian Balzer <chibi at gol.com> >> *Cc:* <ceph-users at lists.ceph.com> >> *Fecha:* Wednesday, 14/05/2014 09:41 >> >> Someone could get a performance throughput on RBD of 600MB/s >> or more on (rw) with a block size of 32768k? >> >> >> Is that 32M then? >> Sequential or randwrite? >> >> I get about those speeds when doing (1M block size) buffered writes >> from within a VM on 20GbE. The cluster max out at about 900MB/s. >> >> Cheers, >> Josef >> >> >> >> *German Anders* >> /Field Storage Support Engineer/** >> >> Despegar.com - IT Team >> >> >> >> >> >> >> >> >> >> >> --- Original message --- >> *Asunto:* Re: Slow IOPS on RBD compared to >> journal and backingdevices >> *De:* Christian Balzer <chibi at gol.com> >> *Para:* Josef Johansson <josef at oderland.se> >> *Cc:* <ceph-users at lists.ceph.com> >> *Fecha:* Wednesday, 14/05/2014 09:33 >> >> >> Hello! >> >> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote: >> >> Hi Christian, >> >> I missed this thread, haven't been reading the list >> that well the last >> weeks. >> >> You already know my setup, since we discussed it in >> an earlier thread. I >> don't have a fast backing store, but I see the slow >> IOPS when doing >> randwrite inside the VM, with rbd cache. Still >> running dumpling here >> though. >> >> Nods, I do recall that thread. >> >> A thought struck me that I could test with a pool >> that consists of OSDs >> that have tempfs-based disks, think I have a bit more >> latency than your >> IPoIB but I've pushed 100k IOPS with the same network >> devices before. >> This would verify if the problem is with the journal >> disks. I'll also >> try to run the journal devices in tempfs as well, as >> it would test >> purely Ceph itself. >> >> That would be interesting indeed. >> Given what I've seen (with the journal at 20% utilization >> and the actual >> filestore ataround 5%) I'd expect Ceph to be the culprit. >> >> I'll get back to you with the results, hopefully I'll >> manage to get them >> done during this night. >> >> Looking forward to that. ^^ >> >> >> Christian >> >> Cheers, >> Josef >> >> On 13/05/14 11:03, Christian Balzer wrote: >> >> I'm clearly talking to myself, but whatever. >> >> For Greg, I've played with all the pertinent >> journal and filestore >> options and TCP nodelay, no changes at all. >> >> Is there anybody on this ML who's running a Ceph >> cluster with a fast >> network and FAST filestore, so like me with a big >> HW cache in front of >> a RAID/JBODs or using SSDs for final storage? >> >> If so, what results do you get out of the fio >> statement below per OSD? >> In my case with 4 OSDs and 3200 IOPS that's about >> 800 IOPS per OSD, >> which is of course vastly faster than the normal >> indvidual HDDs could >> do. >> >> So I'm wondering if I'm hitting some inherent >> limitation of how fast a >> single OSD (as in the software) can handle IOPS, >> given that everything >> else has been ruled out from where I stand. >> >> This would also explain why none of the option >> changes or the use of >> RBD caching has any measurable effect in the test >> case below. >> As in, a slow OSD aka single HDD with journal on >> the same disk would >> clearly benefit from even the small 32MB standard >> RBD cache, while in >> my test case the only time the caching becomes >> noticeable is if I >> increase the cache size to something larger than >> the test data size. >> ^o^ >> >> On the other hand if people here regularly get >> thousands or tens of >> thousands IOPS per OSD with the appropriate HW >> I'm stumped. >> >> Christian >> >> On Fri, 9 May 2014 11:01:26 +0900 Christian >> Balzer wrote: >> >> On Wed, 7 May 2014 22:13:53 -0700 Gregory >> Farnum wrote: >> >> Oh, I didn't notice that. I bet you >> aren't getting the expected >> throughput on the RAID array with OSD >> access patterns, and that's >> applying back pressure on the journal. >> >> In the a "picture" being worth a thousand >> words tradition, I give you >> this iostat -x output taken during a fio run: >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 50.82 0.00 19.43 0.17 0.00 29.58 >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm >> %util >> sda 0.00 51.50 0.00 1633.50 0.00 7460.00 >> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb >> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 >> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 >> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 >> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 >> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 >> 16.60 >> >> The %user CPU utilization is pretty much >> entirely the 2 OSD processes, >> note the nearly complete absence of iowait. >> >> sda and sdb are the OSDs RAIDs, sdc and sdd >> are the journal SSDs. >> Look at these numbers, the lack of queues, >> the low wait and service >> times (this is in ms) plus overall utilization. >> >> The only conclusion I can draw from these >> numbers and the network >> results below is that the latency happens >> within the OSD processes. >> >> Regards, >> >> Christian >> >> When I suggested other tests, I meant >> with and without Ceph. One >> particular one is OSD bench. That should >> be interesting to try at a >> variety of block sizes. You could also >> try runnin RADOS bench and >> smalliobench at a few different sizes. >> -Greg >> >> On Wednesday, May 7, 2014, Alexandre >> DERUMIER <aderumier at odiso.com> >> wrote: >> >> Hi Christian, >> >> Do you have tried without raid6, to >> have more osd ? >> (how many disks do you have begin the >> raid6 ?) >> >> >> Aslo, I known that direct ios can be >> quite slow with ceph, >> >> maybe can you try without --direct=1 >> >> and also enable rbd_cache >> >> ceph.conf >> [client] >> rbd cache = true >> >> >> >> >> ----- Mail original ----- >> >> De: "Christian Balzer" <chibi at gol.com >> <javascript:;>> >> ?: "Gregory Farnum" <greg at inktank.com >> <javascript:;>>, >> ceph-users at lists.ceph.com <javascript:;> >> Envoy?: Jeudi 8 Mai 2014 04:49:16 >> Objet: Re: Slow IOPS on >> RBD compared to journal and >> backing devices >> >> On Wed, 7 May 2014 18:37:48 -0700 >> Gregory Farnum wrote: >> >> On Wed, May 7, 2014 at 5:57 PM, >> Christian Balzer >> <chibi at gol.com<javascript:;>> >> >> wrote: >> >> Hello, >> >> ceph 0.72 on Debian Jessie, 2 >> storage nodes with 2 OSDs >> each. The >> journals are on (separate) DC >> 3700s, the actual OSDs are RAID6 >> behind an Areca 1882 with 4GB >> of cache. >> >> Running this fio: >> >> fio --size=400m >> --ioengine=libaio >> --invalidate=1 --direct=1 >> --numjobs=1 --rw=randwrite >> --name=fiojob --blocksize=4k >> --iodepth=128 >> >> results in: >> >> 30k IOPS on the journal SSD >> (as expected) >> 110k IOPS on the OSD (it fits >> neatly into the cache, no >> surprise >> there) 3200 IOPS from a VM >> using userspace RBD >> 2900 IOPS from a host >> kernelspace mounted RBD >> >> When running the fio from the >> VM RBD the utilization of the >> journals is about 20% (2400 >> IOPS) and the OSDs are bored >> at 2% >> (1500 IOPS after some obvious >> merging). >> The OSD processes are quite >> busy, reading well over 200% >> on atop, >> but the system is not CPU or >> otherwise resource starved at >> that >> moment. >> >> Running multiple instances of >> this test from several VMs on >> different hosts changes >> nothing, as in the aggregated >> IOPS for >> the whole cluster will still >> be around 3200 IOPS. >> >> Now clearly RBD has to deal >> with latency here, but the >> network is >> IPoIB with the associated low >> latency and the journal SSDs are >> the (consistently) fasted >> ones around. >> >> I guess what I am wondering >> about is if this is normal >> and to be >> expected or if not where all >> that potential performance >> got lost. >> >> Hmm, with 128 IOs at a time (I >> believe I'm reading that correctly?) >> >> Yes, but going down to 32 doesn't >> change things one iota. >> Also note the multiple instances I >> mention up there, so that would >> be 256 IOs at a time, coming from >> different hosts over different >> links and nothing changes. >> >> that's about 40ms of latency per >> op (for userspace RBD), which >> seems awfully long. You should >> check what your client-side objecter >> settings are; it might be >> limiting you to fewer outstanding ops >> than that. >> >> Googling for client-side objecter >> gives a few hits on ceph devel and >> bugs and nothing at all as far as >> configuration options are >> concerned. Care to enlighten me where >> one can find those? >> >> Also note the kernelspace (3.13 if it >> matters) speed, which is very >> much in the same (junior league) >> ballpark. >> >> If >> it's available to you, testing >> with Firefly or even master would be >> interesting ? there's some >> performance work that should reduce >> latencies. >> >> Not an option, this is going into >> production next week. >> >> But a well-tuned (or even >> default-tuned, I thought) Ceph >> cluster >> certainly doesn't require >> 40ms/op, so you should probably run a >> wider array of experiments to try >> and figure out where it's coming >> from. >> >> I think we can rule out the network, >> NPtcp gives me: >> --- >> 56: 4096 bytes 1546 times --> 979.22 >> Mbps in 31.91 usec >> --- >> >> For comparison at about 512KB it >> reaches maximum throughput and >> still isn't that laggy: >> --- >> 98: 524288 bytes 121 times --> >> 9700.57 Mbps in 412.35 usec >> --- >> >> So with the network performing as >> well as my lengthy experience with >> IPoIB led me to believe, what else is >> there to look at? >> The storage nodes perform just as >> expected, indicated by the local >> fio tests. >> >> That pretty much leaves only Ceph/RBD >> to look at and I'm not really >> sure what experiments I should run on >> that. ^o^ >> >> Regards, >> >> Christian >> >> -Greg >> Software Engineer #42 @ >> http://inktank.com | http://ceph.com >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi at gol.com <javascript:;> Global >> OnLine Japan/Fusion >> Communications http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com <javascript:;> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi at gol.com Global OnLine Japan/Fusion Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140514/3ebd87ee/attachment.htm>