Slow IOPS on RBD compared to journal and backing devices

aderumier@xxxxxxxxx (Alexandre DERUMIER) · Thu, 08 May 2014 11:31:54 +0200 (CEST)



> The OSD processes are quite busy, reading well over 200% on atop, but
> the system is not CPU or otherwise resource starved at that moment.

osd use 2 threads by default (could explain the 200%)

maybe can you try to put in ceph.conf

osd op threads = 8 


(don't known how many cores you have)


----- Mail original ----- 

De: "Christian Balzer" <chibi at gol.com> 
?: ceph-users at lists.ceph.com 
Cc: "Alexandre DERUMIER" <aderumier at odiso.com> 
Envoy?: Jeudi 8 Mai 2014 08:52:15 
Objet: Re: Slow IOPS on RBD compared to journal and backing devices 

On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote: 

> Stupid question : Is your areca 4GB cache shared between ssd journal and 
> osd ? 
> 
Not a stupid question. 
I made that mistake about 3 years ago in a DRBD setup, OS and activity log 
SSDs on the same controller as the storage disks. 

> or only use by osds ? 
> 
Only used by the OSDs (2 in total, 11x3TB HDD in RAID6). 
I keep repeating myself, neither the journal devices nor the OSDs seem to 
be under any particular load or pressure (utilization) according iostat 
and atop during the tests. 

Christian 

> 
> 
> ----- Mail original ----- 
> 
> De: "Christian Balzer" <chibi at gol.com> 
> ?: ceph-users at lists.ceph.com 
> Envoy?: Jeudi 8 Mai 2014 08:26:33 
> Objet: Re: Slow IOPS on RBD compared to journal and backing 
> devices 
> 
> 
> Hello, 
> 
> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> 
> > Oh, I didn't notice that. I bet you aren't getting the expected 
> > throughput on the RAID array with OSD access patterns, and that's 
> > applying back pressure on the journal. 
> > 
> I doubt that based on what I see in terms of local performance and 
> actual utilization figures according to iostat and atop during the 
> tests. 
> 
> But if that were to be true, how would one see if that's the case, as in 
> where in the plethora of data from: 
> 
> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump 
> 
> is the data I'd be looking for? 
> 
> > When I suggested other tests, I meant with and without Ceph. One 
> > particular one is OSD bench. That should be interesting to try at a 
> > variety of block sizes. You could also try runnin RADOS bench and 
> > smalliobench at a few different sizes. 
> > 
> I already did the local tests, as in w/o Ceph, see the original mail 
> below. 
> 
> And you might recall me doing rados benches as well in another thread 2 
> weeks ago or so. 
> 
> In either case, osd benching gives me: 
> --- 
> # time ceph tell osd.0 bench 
> { "bytes_written": 1073741824, 
> "blocksize": 4194304, 
> "bytes_per_sec": "247102026.000000"} 
> 
> 
> real 0m4.483s 
> --- 
> This is quite a bit slower than this particular SSD (200GB DC 3700) 
> should be able to write, but I will let that slide. 
> Note that it is the journal SSD that gets under pressure here (nearly 
> 900% util) while the OSD is bored at around 15%. Which is no surprise, 
> as it can write data at up to 1600MB/s. 
> 
> at 4k blocks we see: 
> --- 
> # time ceph tell osd.0 bench 1073741824 4096 
> { "bytes_written": 1073741824, 
> "blocksize": 4096, 
> "bytes_per_sec": "9004316.000000"} 
> 
> 
> real 1m59.368s 
> --- 
> Here we get a more balanced picture between journal and storage 
> utilization, hovering around 40-50%. 
> So clearly not overtaxing either component. 
> But yet, this looks like 2100 IOPS to me, if my math is half right. 
> 
> Rados at 4k gives us this: 
> --- 
> Total time run: 30.912786 
> Total writes made: 44490 
> Write size: 4096 
> Bandwidth (MB/sec): 5.622 
> 
> Stddev Bandwidth: 3.31452 
> Max bandwidth (MB/sec): 9.92578 
> Min bandwidth (MB/sec): 0 
> Average Latency: 0.0444653 
> Stddev Latency: 0.121887 
> Max latency: 2.80917 
> Min latency: 0.001958 
> --- 
> So this is even worse, just about 1500 IOPS. 
> 
> Regards, 
> 
> Christian 
> 
> > -Greg 
> > 
> > On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> 
> > wrote: 
> > 
> > > Hi Christian, 
> > > 
> > > Do you have tried without raid6, to have more osd ? 
> > > (how many disks do you have begin the raid6 ?) 
> > > 
> > > 
> > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > 
> > > maybe can you try without --direct=1 
> > > 
> > > and also enable rbd_cache 
> > > 
> > > ceph.conf 
> > > [client] 
> > > rbd cache = true 
> > > 
> > > 
> > > 
> > > 
> > > ----- Mail original ----- 
> > > 
> > > De: "Christian Balzer" <chibi at gol.com <javascript:;>> 
> > > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, 
> > > ceph-users at lists.ceph.com <javascript:;> 
> > > Envoy?: Jeudi 8 Mai 2014 04:49:16 
> > > Objet: Re: Slow IOPS on RBD compared to journal and 
> > > backing devices 
> > > 
> > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > > 
> > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > <chibi at gol.com<javascript:;>> 
> > > wrote: 
> > > > > 
> > > > > Hello, 
> > > > > 
> > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. 
> > > > > The journals are on (separate) DC 3700s, the actual OSDs are 
> > > > > RAID6 behind an Areca 1882 with 4GB of cache. 
> > > > > 
> > > > > Running this fio: 
> > > > > 
> > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > > --iodepth=128 
> > > > > 
> > > > > results in: 
> > > > > 
> > > > > 30k IOPS on the journal SSD (as expected) 
> > > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise 
> > > > > there) 3200 IOPS from a VM using userspace RBD 
> > > > > 2900 IOPS from a host kernelspace mounted RBD 
> > > > > 
> > > > > When running the fio from the VM RBD the utilization of the 
> > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% 
> > > > > (1500 IOPS after some obvious merging). 
> > > > > The OSD processes are quite busy, reading well over 200% on 
> > > > > atop, but the system is not CPU or otherwise resource starved at 
> > > > > that moment. 
> > > > > 
> > > > > Running multiple instances of this test from several VMs on 
> > > > > different hosts changes nothing, as in the aggregated IOPS for 
> > > > > the whole cluster will still be around 3200 IOPS. 
> > > > > 
> > > > > Now clearly RBD has to deal with latency here, but the network 
> > > > > is IPoIB with the associated low latency and the journal SSDs 
> > > > > are the (consistently) fasted ones around. 
> > > > > 
> > > > > I guess what I am wondering about is if this is normal and to be 
> > > > > expected or if not where all that potential performance got 
> > > > > lost. 
> > > > 
> > > > Hmm, with 128 IOs at a time (I believe I'm reading that 
> > > > correctly?) 
> > > Yes, but going down to 32 doesn't change things one iota. 
> > > Also note the multiple instances I mention up there, so that would 
> > > be 256 IOs at a time, coming from different hosts over different 
> > > links and nothing changes. 
> > > 
> > > > that's about 40ms of latency per op (for userspace RBD), which 
> > > > seems awfully long. You should check what your client-side 
> > > > objecter settings are; it might be limiting you to fewer 
> > > > outstanding ops than that. 
> > > 
> > > Googling for client-side objecter gives a few hits on ceph devel and 
> > > bugs and nothing at all as far as configuration options are 
> > > concerned. Care to enlighten me where one can find those? 
> > > 
> > > Also note the kernelspace (3.13 if it matters) speed, which is very 
> > > much in the same (junior league) ballpark. 
> > > 
> > > > If 
> > > > it's available to you, testing with Firefly or even master would 
> > > > be interesting ? there's some performance work that should reduce 
> > > > latencies. 
> > > > 
> > > Not an option, this is going into production next week. 
> > > 
> > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster 
> > > > certainly doesn't require 40ms/op, so you should probably run a 
> > > > wider array of experiments to try and figure out where it's coming 
> > > > from. 
> > > 
> > > I think we can rule out the network, NPtcp gives me: 
> > > --- 
> > > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec 
> > > --- 
> > > 
> > > For comparison at about 512KB it reaches maximum throughput and 
> > > still isn't that laggy: 
> > > --- 
> > > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec 
> > > --- 
> > > 
> > > So with the network performing as well as my lengthy experience with 
> > > IPoIB led me to believe, what else is there to look at? 
> > > The storage nodes perform just as expected, indicated by the local 
> > > fio tests. 
> > > 
> > > That pretty much leaves only Ceph/RBD to look at and I'm not really 
> > > sure what experiments I should run on that. ^o^ 
> > > 
> > > Regards, 
> > > 
> > > Christian 
> > > 
> > > > -Greg 
> > > > Software Engineer #42 @ http://inktank.com | http://ceph.com 
> > > > 
> > > 
> > > 
> > > -- 
> > > Christian Balzer Network/Systems Engineer 
> > > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion 
> > > Communications http://www.gol.com/ 
> > > _______________________________________________ 
> > > ceph-users mailing list 
> > > ceph-users at lists.ceph.com <javascript:;> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > > 
> > 
> > 
> 
> 


-- 
Christian Balzer Network/Systems Engineer 
chibi at gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/