Hello, On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: > Hi Christian, > > I'm going to test a full ssd cluster in coming months, > I'll send result on the mailing. > Looking forward to that. > > Do you have tried to use 1 osd by physical disk ? (without raid6) > No, if you look back to the last year December "Sanity check..." thread by me, it gives the reasons. In short, highest density (thus replication of 2 and to make that safe based on RAID6) and operational maintainability (it is a remote data center, so replacing broken disks is a pain). That cluster is fast enough for my purposes and that fio test isn't a typical load for it when it goes into production. But for designing a general purpose or high performance Ceph cluster in the future I'd really love to have this mystery solved. > Maybe they are bottleneck in osd daemon, > and using osd daemon by disk could help. > It might, but at the IOPS I'm seeing anybody using SSD for file storage should have screamed out already. Also given the CPU usage I'm seeing during that test run such a setup would probably require 32+ cores. Christian > > > > ----- Mail original ----- > > De: "Christian Balzer" <chibi at gol.com> > ?: ceph-users at lists.ceph.com > Envoy?: Mardi 13 Mai 2014 11:03:47 > Objet: Re: Slow IOPS on RBD compared to journal and backing > devices > > > I'm clearly talking to myself, but whatever. > > For Greg, I've played with all the pertinent journal and filestore > options and TCP nodelay, no changes at all. > > Is there anybody on this ML who's running a Ceph cluster with a fast > network and FAST filestore, so like me with a big HW cache in front of a > RAID/JBODs or using SSDs for final storage? > > If so, what results do you get out of the fio statement below per OSD? > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, > which is of course vastly faster than the normal indvidual HDDs could > do. > > So I'm wondering if I'm hitting some inherent limitation of how fast a > single OSD (as in the software) can handle IOPS, given that everything > else has been ruled out from where I stand. > > This would also explain why none of the option changes or the use of > RBD caching has any measurable effect in the test case below. > As in, a slow OSD aka single HDD with journal on the same disk would > clearly benefit from even the small 32MB standard RBD cache, while in my > test case the only time the caching becomes noticeable is if I increase > the cache size to something larger than the test data size. ^o^ > > On the other hand if people here regularly get thousands or tens of > thousands IOPS per OSD with the appropriate HW I'm stumped. > > Christian > > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: > > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > > > Oh, I didn't notice that. I bet you aren't getting the expected > > > throughput on the RAID array with OSD access patterns, and that's > > > applying back pressure on the journal. > > > > > > > In the a "picture" being worth a thousand words tradition, I give you > > this iostat -x output taken during a fio run: > > > > avg-cpu: %user %nice %system %iowait %steal %idle > > 50.82 0.00 19.43 0.17 0.00 29.58 > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 > > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb > > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 > > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 > > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 > > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 > > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 > > > > The %user CPU utilization is pretty much entirely the 2 OSD processes, > > note the nearly complete absence of iowait. > > > > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. > > Look at these numbers, the lack of queues, the low wait and service > > times (this is in ms) plus overall utilization. > > > > The only conclusion I can draw from these numbers and the network > > results below is that the latency happens within the OSD processes. > > > > Regards, > > > > Christian > > > When I suggested other tests, I meant with and without Ceph. One > > > particular one is OSD bench. That should be interesting to try at a > > > variety of block sizes. You could also try runnin RADOS bench and > > > smalliobench at a few different sizes. > > > -Greg > > > > > > On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> > > > wrote: > > > > > > > Hi Christian, > > > > > > > > Do you have tried without raid6, to have more osd ? > > > > (how many disks do you have begin the raid6 ?) > > > > > > > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > > > > > maybe can you try without --direct=1 > > > > > > > > and also enable rbd_cache > > > > > > > > ceph.conf > > > > [client] > > > > rbd cache = true > > > > > > > > > > > > > > > > > > > > ----- Mail original ----- > > > > > > > > De: "Christian Balzer" <chibi at gol.com <javascript:;>> > > > > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, > > > > ceph-users at lists.ceph.com <javascript:;> > > > > Envoy?: Jeudi 8 Mai 2014 04:49:16 > > > > Objet: Re: Slow IOPS on RBD compared to journal and > > > > backing devices > > > > > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > <chibi at gol.com<javascript:;>> > > > > wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. > > > > > > The journals are on (separate) DC 3700s, the actual OSDs are > > > > > > RAID6 behind an Areca 1882 with 4GB of cache. > > > > > > > > > > > > Running this fio: > > > > > > > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > > > > > --iodepth=128 > > > > > > > > > > > > results in: > > > > > > > > > > > > 30k IOPS on the journal SSD (as expected) > > > > > > 110k IOPS on the OSD (it fits neatly into the cache, no > > > > > > surprise there) 3200 IOPS from a VM using userspace RBD > > > > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > > > > > > > When running the fio from the VM RBD the utilization of the > > > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > > > > > > (1500 IOPS after some obvious merging). > > > > > > The OSD processes are quite busy, reading well over 200% on > > > > > > atop, but the system is not CPU or otherwise resource starved > > > > > > at that moment. > > > > > > > > > > > > Running multiple instances of this test from several VMs on > > > > > > different hosts changes nothing, as in the aggregated IOPS for > > > > > > the whole cluster will still be around 3200 IOPS. > > > > > > > > > > > > Now clearly RBD has to deal with latency here, but the network > > > > > > is IPoIB with the associated low latency and the journal SSDs > > > > > > are the (consistently) fasted ones around. > > > > > > > > > > > > I guess what I am wondering about is if this is normal and to > > > > > > be expected or if not where all that potential performance got > > > > > > lost. > > > > > > > > > > Hmm, with 128 IOs at a time (I believe I'm reading that > > > > > correctly?) > > > > Yes, but going down to 32 doesn't change things one iota. > > > > Also note the multiple instances I mention up there, so that would > > > > be 256 IOs at a time, coming from different hosts over different > > > > links and nothing changes. > > > > > > > > > that's about 40ms of latency per op (for userspace RBD), which > > > > > seems awfully long. You should check what your client-side > > > > > objecter settings are; it might be limiting you to fewer > > > > > outstanding ops than that. > > > > > > > > Googling for client-side objecter gives a few hits on ceph devel > > > > and bugs and nothing at all as far as configuration options are > > > > concerned. Care to enlighten me where one can find those? > > > > > > > > Also note the kernelspace (3.13 if it matters) speed, which is > > > > very much in the same (junior league) ballpark. > > > > > > > > > If > > > > > it's available to you, testing with Firefly or even master would > > > > > be interesting ? there's some performance work that should > > > > > reduce latencies. > > > > > > > > > Not an option, this is going into production next week. > > > > > > > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster > > > > > certainly doesn't require 40ms/op, so you should probably run a > > > > > wider array of experiments to try and figure out where it's > > > > > coming from. > > > > > > > > I think we can rule out the network, NPtcp gives me: > > > > --- > > > > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec > > > > --- > > > > > > > > For comparison at about 512KB it reaches maximum throughput and > > > > still isn't that laggy: > > > > --- > > > > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec > > > > --- > > > > > > > > So with the network performing as well as my lengthy experience > > > > with IPoIB led me to believe, what else is there to look at? > > > > The storage nodes perform just as expected, indicated by the local > > > > fio tests. > > > > > > > > That pretty much leaves only Ceph/RBD to look at and I'm not > > > > really sure what experiments I should run on that. ^o^ > > > > > > > > Regards, > > > > > > > > Christian > > > > > > > > > -Greg > > > > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > > > > > > > > > > > > > > -- > > > > Christian Balzer Network/Systems Engineer > > > > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion > > > > Communications http://www.gol.com/ > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users at lists.ceph.com <javascript:;> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/