Hello! On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote: > Hi Christian, > > I missed this thread, haven't been reading the list that well the last > weeks. > > You already know my setup, since we discussed it in an earlier thread. I > don't have a fast backing store, but I see the slow IOPS when doing > randwrite inside the VM, with rbd cache. Still running dumpling here > though. > Nods, I do recall that thread. > A thought struck me that I could test with a pool that consists of OSDs > that have tempfs-based disks, think I have a bit more latency than your > IPoIB but I've pushed 100k IOPS with the same network devices before. > This would verify if the problem is with the journal disks. I'll also > try to run the journal devices in tempfs as well, as it would test > purely Ceph itself. > That would be interesting indeed. Given what I've seen (with the journal at 20% utilization and the actual filestore ataround 5%) I'd expect Ceph to be the culprit. > I'll get back to you with the results, hopefully I'll manage to get them > done during this night. > Looking forward to that. ^^ Christian > Cheers, > Josef > > On 13/05/14 11:03, Christian Balzer wrote: > > I'm clearly talking to myself, but whatever. > > > > For Greg, I've played with all the pertinent journal and filestore > > options and TCP nodelay, no changes at all. > > > > Is there anybody on this ML who's running a Ceph cluster with a fast > > network and FAST filestore, so like me with a big HW cache in front of > > a RAID/JBODs or using SSDs for final storage? > > > > If so, what results do you get out of the fio statement below per OSD? > > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, > > which is of course vastly faster than the normal indvidual HDDs could > > do. > > > > So I'm wondering if I'm hitting some inherent limitation of how fast a > > single OSD (as in the software) can handle IOPS, given that everything > > else has been ruled out from where I stand. > > > > This would also explain why none of the option changes or the use of > > RBD caching has any measurable effect in the test case below. > > As in, a slow OSD aka single HDD with journal on the same disk would > > clearly benefit from even the small 32MB standard RBD cache, while in > > my test case the only time the caching becomes noticeable is if I > > increase the cache size to something larger than the test data size. > > ^o^ > > > > On the other hand if people here regularly get thousands or tens of > > thousands IOPS per OSD with the appropriate HW I'm stumped. > > > > Christian > > > > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > > > >> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > >> > >>> Oh, I didn't notice that. I bet you aren't getting the expected > >>> throughput on the RAID array with OSD access patterns, and that's > >>> applying back pressure on the journal. > >>> > >> In the a "picture" being worth a thousand words tradition, I give you > >> this iostat -x output taken during a fio run: > >> > >> avg-cpu: %user %nice %system %iowait %steal %idle > >> 50.82 0.00 19.43 0.17 0.00 29.58 > >> > >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > >> avgrq-sz avgqu-sz await r_await w_await svctm %util > >> sda 0.00 51.50 0.00 1633.50 0.00 7460.00 > >> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb > >> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 > >> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 > >> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 > >> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 > >> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 > >> > >> The %user CPU utilization is pretty much entirely the 2 OSD processes, > >> note the nearly complete absence of iowait. > >> > >> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. > >> Look at these numbers, the lack of queues, the low wait and service > >> times (this is in ms) plus overall utilization. > >> > >> The only conclusion I can draw from these numbers and the network > >> results below is that the latency happens within the OSD processes. > >> > >> Regards, > >> > >> Christian > >>> When I suggested other tests, I meant with and without Ceph. One > >>> particular one is OSD bench. That should be interesting to try at a > >>> variety of block sizes. You could also try runnin RADOS bench and > >>> smalliobench at a few different sizes. > >>> -Greg > >>> > >>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> > >>> wrote: > >>> > >>>> Hi Christian, > >>>> > >>>> Do you have tried without raid6, to have more osd ? > >>>> (how many disks do you have begin the raid6 ?) > >>>> > >>>> > >>>> Aslo, I known that direct ios can be quite slow with ceph, > >>>> > >>>> maybe can you try without --direct=1 > >>>> > >>>> and also enable rbd_cache > >>>> > >>>> ceph.conf > >>>> [client] > >>>> rbd cache = true > >>>> > >>>> > >>>> > >>>> > >>>> ----- Mail original ----- > >>>> > >>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>> > >>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, > >>>> ceph-users at lists.ceph.com <javascript:;> > >>>> Envoy?: Jeudi 8 Mai 2014 04:49:16 > >>>> Objet: Re: Slow IOPS on RBD compared to journal and > >>>> backing devices > >>>> > >>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > >>>> > >>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > >>>>> <chibi at gol.com<javascript:;>> > >>>> wrote: > >>>>>> Hello, > >>>>>> > >>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The > >>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6 > >>>>>> behind an Areca 1882 with 4GB of cache. > >>>>>> > >>>>>> Running this fio: > >>>>>> > >>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > >>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > >>>>>> --iodepth=128 > >>>>>> > >>>>>> results in: > >>>>>> > >>>>>> 30k IOPS on the journal SSD (as expected) > >>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise > >>>>>> there) 3200 IOPS from a VM using userspace RBD > >>>>>> 2900 IOPS from a host kernelspace mounted RBD > >>>>>> > >>>>>> When running the fio from the VM RBD the utilization of the > >>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > >>>>>> (1500 IOPS after some obvious merging). > >>>>>> The OSD processes are quite busy, reading well over 200% on atop, > >>>>>> but the system is not CPU or otherwise resource starved at that > >>>>>> moment. > >>>>>> > >>>>>> Running multiple instances of this test from several VMs on > >>>>>> different hosts changes nothing, as in the aggregated IOPS for > >>>>>> the whole cluster will still be around 3200 IOPS. > >>>>>> > >>>>>> Now clearly RBD has to deal with latency here, but the network is > >>>>>> IPoIB with the associated low latency and the journal SSDs are > >>>>>> the (consistently) fasted ones around. > >>>>>> > >>>>>> I guess what I am wondering about is if this is normal and to be > >>>>>> expected or if not where all that potential performance got lost. > >>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) > >>>> Yes, but going down to 32 doesn't change things one iota. > >>>> Also note the multiple instances I mention up there, so that would > >>>> be 256 IOs at a time, coming from different hosts over different > >>>> links and nothing changes. > >>>> > >>>>> that's about 40ms of latency per op (for userspace RBD), which > >>>>> seems awfully long. You should check what your client-side objecter > >>>>> settings are; it might be limiting you to fewer outstanding ops > >>>>> than that. > >>>> Googling for client-side objecter gives a few hits on ceph devel and > >>>> bugs and nothing at all as far as configuration options are > >>>> concerned. Care to enlighten me where one can find those? > >>>> > >>>> Also note the kernelspace (3.13 if it matters) speed, which is very > >>>> much in the same (junior league) ballpark. > >>>> > >>>>> If > >>>>> it's available to you, testing with Firefly or even master would be > >>>>> interesting ? there's some performance work that should reduce > >>>>> latencies. > >>>>> > >>>> Not an option, this is going into production next week. > >>>> > >>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster > >>>>> certainly doesn't require 40ms/op, so you should probably run a > >>>>> wider array of experiments to try and figure out where it's coming > >>>>> from. > >>>> I think we can rule out the network, NPtcp gives me: > >>>> --- > >>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec > >>>> --- > >>>> > >>>> For comparison at about 512KB it reaches maximum throughput and > >>>> still isn't that laggy: > >>>> --- > >>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec > >>>> --- > >>>> > >>>> So with the network performing as well as my lengthy experience with > >>>> IPoIB led me to believe, what else is there to look at? > >>>> The storage nodes perform just as expected, indicated by the local > >>>> fio tests. > >>>> > >>>> That pretty much leaves only Ceph/RBD to look at and I'm not really > >>>> sure what experiments I should run on that. ^o^ > >>>> > >>>> Regards, > >>>> > >>>> Christian > >>>> > >>>>> -Greg > >>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com > >>>>> > >>>> > >>>> -- > >>>> Christian Balzer Network/Systems Engineer > >>>> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion > >>>> Communications http://www.gol.com/ > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users at lists.ceph.com <javascript:;> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> > >>> > >> > > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/