Hello, On Thu, 08 May 2014 11:31:54 +0200 (CEST) Alexandre DERUMIER wrote: > > The OSD processes are quite busy, reading well over 200% on atop, but > > the system is not CPU or otherwise resource starved at that moment. > > osd use 2 threads by default (could explain the 200%) > > maybe can you try to put in ceph.conf > > osd op threads = 8 > Already at 10 (for some weeks now). ^o^ How that setting relates to the actual 220 threads per OSD process is a mystery for another day. > > (don't known how many cores you have) > 6. The OSDs get busy (CPU, not IOWAIT), but there still are 1-2 cores idle at that point. > > > ----- Mail original ----- > > De: "Christian Balzer" <chibi at gol.com> > ?: ceph-users at lists.ceph.com > Cc: "Alexandre DERUMIER" <aderumier at odiso.com> > Envoy?: Jeudi 8 Mai 2014 08:52:15 > Objet: Re: Slow IOPS on RBD compared to journal and backing > devices > > On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote: > > > Stupid question : Is your areca 4GB cache shared between ssd journal > > and osd ? > > > Not a stupid question. > I made that mistake about 3 years ago in a DRBD setup, OS and activity > log SSDs on the same controller as the storage disks. > > > or only use by osds ? > > > Only used by the OSDs (2 in total, 11x3TB HDD in RAID6). > I keep repeating myself, neither the journal devices nor the OSDs seem > to be under any particular load or pressure (utilization) according > iostat and atop during the tests. > > Christian > > > > > > > ----- Mail original ----- > > > > De: "Christian Balzer" <chibi at gol.com> > > ?: ceph-users at lists.ceph.com > > Envoy?: Jeudi 8 Mai 2014 08:26:33 > > Objet: Re: Slow IOPS on RBD compared to journal and > > backing devices > > > > > > Hello, > > > > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: > > > > > Oh, I didn't notice that. I bet you aren't getting the expected > > > throughput on the RAID array with OSD access patterns, and that's > > > applying back pressure on the journal. > > > > > I doubt that based on what I see in terms of local performance and > > actual utilization figures according to iostat and atop during the > > tests. > > > > But if that were to be true, how would one see if that's the case, as > > in where in the plethora of data from: > > > > ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump > > > > is the data I'd be looking for? > > > > > When I suggested other tests, I meant with and without Ceph. One > > > particular one is OSD bench. That should be interesting to try at a > > > variety of block sizes. You could also try runnin RADOS bench and > > > smalliobench at a few different sizes. > > > > > I already did the local tests, as in w/o Ceph, see the original mail > > below. > > > > And you might recall me doing rados benches as well in another thread > > 2 weeks ago or so. > > > > In either case, osd benching gives me: > > --- > > # time ceph tell osd.0 bench > > { "bytes_written": 1073741824, > > "blocksize": 4194304, > > "bytes_per_sec": "247102026.000000"} > > > > > > real 0m4.483s > > --- > > This is quite a bit slower than this particular SSD (200GB DC 3700) > > should be able to write, but I will let that slide. > > Note that it is the journal SSD that gets under pressure here (nearly > > 900% util) while the OSD is bored at around 15%. Which is no surprise, > > as it can write data at up to 1600MB/s. > > > > at 4k blocks we see: > > --- > > # time ceph tell osd.0 bench 1073741824 4096 > > { "bytes_written": 1073741824, > > "blocksize": 4096, > > "bytes_per_sec": "9004316.000000"} > > > > > > real 1m59.368s > > --- > > Here we get a more balanced picture between journal and storage > > utilization, hovering around 40-50%. > > So clearly not overtaxing either component. > > But yet, this looks like 2100 IOPS to me, if my math is half right. > > > > Rados at 4k gives us this: > > --- > > Total time run: 30.912786 > > Total writes made: 44490 > > Write size: 4096 > > Bandwidth (MB/sec): 5.622 > > > > Stddev Bandwidth: 3.31452 > > Max bandwidth (MB/sec): 9.92578 > > Min bandwidth (MB/sec): 0 > > Average Latency: 0.0444653 > > Stddev Latency: 0.121887 > > Max latency: 2.80917 > > Min latency: 0.001958 > > --- > > So this is even worse, just about 1500 IOPS. > > > > Regards, > > > > Christian > > > > > -Greg > > > > > > On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> > > > wrote: > > > > > > > Hi Christian, > > > > > > > > Do you have tried without raid6, to have more osd ? > > > > (how many disks do you have begin the raid6 ?) > > > > > > > > > > > > Aslo, I known that direct ios can be quite slow with ceph, > > > > > > > > maybe can you try without --direct=1 > > > > > > > > and also enable rbd_cache > > > > > > > > ceph.conf > > > > [client] > > > > rbd cache = true > > > > > > > > > > > > > > > > > > > > ----- Mail original ----- > > > > > > > > De: "Christian Balzer" <chibi at gol.com <javascript:;>> > > > > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, > > > > ceph-users at lists.ceph.com <javascript:;> > > > > Envoy?: Jeudi 8 Mai 2014 04:49:16 > > > > Objet: Re: Slow IOPS on RBD compared to journal and > > > > backing devices > > > > > > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: > > > > > > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer > > > > > <chibi at gol.com<javascript:;>> > > > > wrote: > > > > > > > > > > > > Hello, > > > > > > > > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. > > > > > > The journals are on (separate) DC 3700s, the actual OSDs are > > > > > > RAID6 behind an Areca 1882 with 4GB of cache. > > > > > > > > > > > > Running this fio: > > > > > > > > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 > > > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k > > > > > > --iodepth=128 > > > > > > > > > > > > results in: > > > > > > > > > > > > 30k IOPS on the journal SSD (as expected) > > > > > > 110k IOPS on the OSD (it fits neatly into the cache, no > > > > > > surprise there) 3200 IOPS from a VM using userspace RBD > > > > > > 2900 IOPS from a host kernelspace mounted RBD > > > > > > > > > > > > When running the fio from the VM RBD the utilization of the > > > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% > > > > > > (1500 IOPS after some obvious merging). > > > > > > The OSD processes are quite busy, reading well over 200% on > > > > > > atop, but the system is not CPU or otherwise resource starved > > > > > > at that moment. > > > > > > > > > > > > Running multiple instances of this test from several VMs on > > > > > > different hosts changes nothing, as in the aggregated IOPS for > > > > > > the whole cluster will still be around 3200 IOPS. > > > > > > > > > > > > Now clearly RBD has to deal with latency here, but the network > > > > > > is IPoIB with the associated low latency and the journal SSDs > > > > > > are the (consistently) fasted ones around. > > > > > > > > > > > > I guess what I am wondering about is if this is normal and to > > > > > > be expected or if not where all that potential performance got > > > > > > lost. > > > > > > > > > > Hmm, with 128 IOs at a time (I believe I'm reading that > > > > > correctly?) > > > > Yes, but going down to 32 doesn't change things one iota. > > > > Also note the multiple instances I mention up there, so that would > > > > be 256 IOs at a time, coming from different hosts over different > > > > links and nothing changes. > > > > > > > > > that's about 40ms of latency per op (for userspace RBD), which > > > > > seems awfully long. You should check what your client-side > > > > > objecter settings are; it might be limiting you to fewer > > > > > outstanding ops than that. > > > > > > > > Googling for client-side objecter gives a few hits on ceph devel > > > > and bugs and nothing at all as far as configuration options are > > > > concerned. Care to enlighten me where one can find those? > > > > > > > > Also note the kernelspace (3.13 if it matters) speed, which is > > > > very much in the same (junior league) ballpark. > > > > > > > > > If > > > > > it's available to you, testing with Firefly or even master would > > > > > be interesting ? there's some performance work that should > > > > > reduce latencies. > > > > > > > > > Not an option, this is going into production next week. > > > > > > > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster > > > > > certainly doesn't require 40ms/op, so you should probably run a > > > > > wider array of experiments to try and figure out where it's > > > > > coming from. > > > > > > > > I think we can rule out the network, NPtcp gives me: > > > > --- > > > > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec > > > > --- > > > > > > > > For comparison at about 512KB it reaches maximum throughput and > > > > still isn't that laggy: > > > > --- > > > > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec > > > > --- > > > > > > > > So with the network performing as well as my lengthy experience > > > > with IPoIB led me to believe, what else is there to look at? > > > > The storage nodes perform just as expected, indicated by the local > > > > fio tests. > > > > > > > > That pretty much leaves only Ceph/RBD to look at and I'm not > > > > really sure what experiments I should run on that. ^o^ > > > > > > > > Regards, > > > > > > > > Christian > > > > > > > > > -Greg > > > > > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > > > > > > > > > > > > > > > -- > > > > Christian Balzer Network/Systems Engineer > > > > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion > > > > Communications http://www.gol.com/ > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users at lists.ceph.com <javascript:;> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/