Slow IOPS on RBD compared to journal and backing devices

chibi@xxxxxxx (Christian Balzer) · Thu, 8 May 2014 18:42:12 +0900

Hello,

On Thu, 08 May 2014 11:31:54 +0200 (CEST) Alexandre DERUMIER wrote:

> > The OSD processes are quite busy, reading well over 200% on atop, but
> > the system is not CPU or otherwise resource starved at that moment.
> 
> osd use 2 threads by default (could explain the 200%)
> 
> maybe can you try to put in ceph.conf
> 
> osd op threads = 8 
>
Already at 10 (for some weeks now). ^o^

How that setting relates to the actual 220 threads per OSD process is a
mystery for another day.

> 
> (don't known how many cores you have)
> 
6.
The OSDs get busy (CPU, not IOWAIT), but there still are 1-2 cores idle at
that point.

> 
> 
> ----- Mail original ----- 
> 
> De: "Christian Balzer" <chibi at gol.com> 
> ?: ceph-users at lists.ceph.com 
> Cc: "Alexandre DERUMIER" <aderumier at odiso.com> 
> Envoy?: Jeudi 8 Mai 2014 08:52:15 
> Objet: Re: Slow IOPS on RBD compared to journal and backing
> devices 
> 
> On Thu, 08 May 2014 08:41:54 +0200 (CEST) Alexandre DERUMIER wrote: 
> 
> > Stupid question : Is your areca 4GB cache shared between ssd journal
> > and osd ? 
> > 
> Not a stupid question. 
> I made that mistake about 3 years ago in a DRBD setup, OS and activity
> log SSDs on the same controller as the storage disks. 
> 
> > or only use by osds ? 
> > 
> Only used by the OSDs (2 in total, 11x3TB HDD in RAID6). 
> I keep repeating myself, neither the journal devices nor the OSDs seem
> to be under any particular load or pressure (utilization) according
> iostat and atop during the tests. 
> 
> Christian 
> 
> > 
> > 
> > ----- Mail original ----- 
> > 
> > De: "Christian Balzer" <chibi at gol.com> 
> > ?: ceph-users at lists.ceph.com 
> > Envoy?: Jeudi 8 Mai 2014 08:26:33 
> > Objet: Re: Slow IOPS on RBD compared to journal and
> > backing devices 
> > 
> > 
> > Hello, 
> > 
> > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> > 
> > > Oh, I didn't notice that. I bet you aren't getting the expected 
> > > throughput on the RAID array with OSD access patterns, and that's 
> > > applying back pressure on the journal. 
> > > 
> > I doubt that based on what I see in terms of local performance and 
> > actual utilization figures according to iostat and atop during the 
> > tests. 
> > 
> > But if that were to be true, how would one see if that's the case, as
> > in where in the plethora of data from: 
> > 
> > ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok perf dump 
> > 
> > is the data I'd be looking for? 
> > 
> > > When I suggested other tests, I meant with and without Ceph. One 
> > > particular one is OSD bench. That should be interesting to try at a 
> > > variety of block sizes. You could also try runnin RADOS bench and 
> > > smalliobench at a few different sizes. 
> > > 
> > I already did the local tests, as in w/o Ceph, see the original mail 
> > below. 
> > 
> > And you might recall me doing rados benches as well in another thread
> > 2 weeks ago or so. 
> > 
> > In either case, osd benching gives me: 
> > --- 
> > # time ceph tell osd.0 bench 
> > { "bytes_written": 1073741824, 
> > "blocksize": 4194304, 
> > "bytes_per_sec": "247102026.000000"} 
> > 
> > 
> > real 0m4.483s 
> > --- 
> > This is quite a bit slower than this particular SSD (200GB DC 3700) 
> > should be able to write, but I will let that slide. 
> > Note that it is the journal SSD that gets under pressure here (nearly 
> > 900% util) while the OSD is bored at around 15%. Which is no surprise, 
> > as it can write data at up to 1600MB/s. 
> > 
> > at 4k blocks we see: 
> > --- 
> > # time ceph tell osd.0 bench 1073741824 4096 
> > { "bytes_written": 1073741824, 
> > "blocksize": 4096, 
> > "bytes_per_sec": "9004316.000000"} 
> > 
> > 
> > real 1m59.368s 
> > --- 
> > Here we get a more balanced picture between journal and storage 
> > utilization, hovering around 40-50%. 
> > So clearly not overtaxing either component. 
> > But yet, this looks like 2100 IOPS to me, if my math is half right. 
> > 
> > Rados at 4k gives us this: 
> > --- 
> > Total time run: 30.912786 
> > Total writes made: 44490 
> > Write size: 4096 
> > Bandwidth (MB/sec): 5.622 
> > 
> > Stddev Bandwidth: 3.31452 
> > Max bandwidth (MB/sec): 9.92578 
> > Min bandwidth (MB/sec): 0 
> > Average Latency: 0.0444653 
> > Stddev Latency: 0.121887 
> > Max latency: 2.80917 
> > Min latency: 0.001958 
> > --- 
> > So this is even worse, just about 1500 IOPS. 
> > 
> > Regards, 
> > 
> > Christian 
> > 
> > > -Greg 
> > > 
> > > On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> 
> > > wrote: 
> > > 
> > > > Hi Christian, 
> > > > 
> > > > Do you have tried without raid6, to have more osd ? 
> > > > (how many disks do you have begin the raid6 ?) 
> > > > 
> > > > 
> > > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > > 
> > > > maybe can you try without --direct=1 
> > > > 
> > > > and also enable rbd_cache 
> > > > 
> > > > ceph.conf 
> > > > [client] 
> > > > rbd cache = true 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ----- Mail original ----- 
> > > > 
> > > > De: "Christian Balzer" <chibi at gol.com <javascript:;>> 
> > > > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, 
> > > > ceph-users at lists.ceph.com <javascript:;> 
> > > > Envoy?: Jeudi 8 Mai 2014 04:49:16 
> > > > Objet: Re: Slow IOPS on RBD compared to journal and 
> > > > backing devices 
> > > > 
> > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > > > 
> > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > > <chibi at gol.com<javascript:;>> 
> > > > wrote: 
> > > > > > 
> > > > > > Hello, 
> > > > > > 
> > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. 
> > > > > > The journals are on (separate) DC 3700s, the actual OSDs are 
> > > > > > RAID6 behind an Areca 1882 with 4GB of cache. 
> > > > > > 
> > > > > > Running this fio: 
> > > > > > 
> > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > > > --iodepth=128 
> > > > > > 
> > > > > > results in: 
> > > > > > 
> > > > > > 30k IOPS on the journal SSD (as expected) 
> > > > > > 110k IOPS on the OSD (it fits neatly into the cache, no
> > > > > > surprise there) 3200 IOPS from a VM using userspace RBD 
> > > > > > 2900 IOPS from a host kernelspace mounted RBD 
> > > > > > 
> > > > > > When running the fio from the VM RBD the utilization of the 
> > > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% 
> > > > > > (1500 IOPS after some obvious merging). 
> > > > > > The OSD processes are quite busy, reading well over 200% on 
> > > > > > atop, but the system is not CPU or otherwise resource starved
> > > > > > at that moment. 
> > > > > > 
> > > > > > Running multiple instances of this test from several VMs on 
> > > > > > different hosts changes nothing, as in the aggregated IOPS for 
> > > > > > the whole cluster will still be around 3200 IOPS. 
> > > > > > 
> > > > > > Now clearly RBD has to deal with latency here, but the network 
> > > > > > is IPoIB with the associated low latency and the journal SSDs 
> > > > > > are the (consistently) fasted ones around. 
> > > > > > 
> > > > > > I guess what I am wondering about is if this is normal and to
> > > > > > be expected or if not where all that potential performance got 
> > > > > > lost. 
> > > > > 
> > > > > Hmm, with 128 IOs at a time (I believe I'm reading that 
> > > > > correctly?) 
> > > > Yes, but going down to 32 doesn't change things one iota. 
> > > > Also note the multiple instances I mention up there, so that would 
> > > > be 256 IOs at a time, coming from different hosts over different 
> > > > links and nothing changes. 
> > > > 
> > > > > that's about 40ms of latency per op (for userspace RBD), which 
> > > > > seems awfully long. You should check what your client-side 
> > > > > objecter settings are; it might be limiting you to fewer 
> > > > > outstanding ops than that. 
> > > > 
> > > > Googling for client-side objecter gives a few hits on ceph devel
> > > > and bugs and nothing at all as far as configuration options are 
> > > > concerned. Care to enlighten me where one can find those? 
> > > > 
> > > > Also note the kernelspace (3.13 if it matters) speed, which is
> > > > very much in the same (junior league) ballpark. 
> > > > 
> > > > > If 
> > > > > it's available to you, testing with Firefly or even master would 
> > > > > be interesting ? there's some performance work that should
> > > > > reduce latencies. 
> > > > > 
> > > > Not an option, this is going into production next week. 
> > > > 
> > > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster 
> > > > > certainly doesn't require 40ms/op, so you should probably run a 
> > > > > wider array of experiments to try and figure out where it's
> > > > > coming from. 
> > > > 
> > > > I think we can rule out the network, NPtcp gives me: 
> > > > --- 
> > > > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec 
> > > > --- 
> > > > 
> > > > For comparison at about 512KB it reaches maximum throughput and 
> > > > still isn't that laggy: 
> > > > --- 
> > > > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec 
> > > > --- 
> > > > 
> > > > So with the network performing as well as my lengthy experience
> > > > with IPoIB led me to believe, what else is there to look at? 
> > > > The storage nodes perform just as expected, indicated by the local 
> > > > fio tests. 
> > > > 
> > > > That pretty much leaves only Ceph/RBD to look at and I'm not
> > > > really sure what experiments I should run on that. ^o^ 
> > > > 
> > > > Regards, 
> > > > 
> > > > Christian 
> > > > 
> > > > > -Greg 
> > > > > Software Engineer #42 @ http://inktank.com | http://ceph.com 
> > > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Christian Balzer Network/Systems Engineer 
> > > > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion 
> > > > Communications http://www.gol.com/ 
> > > > _______________________________________________ 
> > > > ceph-users mailing list 
> > > > ceph-users at lists.ceph.com <javascript:;> 
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/