Slow IOPS on RBD compared to journal and backing devices

aderumier@xxxxxxxxx (Alexandre DERUMIER) · Tue, 13 May 2014 13:36:49 +0200 (CEST)

>>It might, but at the IOPS I'm seeing anybody using SSD for file storage 
>>should have screamed out already. 
>>Also given the CPU usage I'm seeing during that test run such a setup 
>>would probably require 32+ cores. 

Just found this:

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

page12:

" Note: As of Ceph Dumpling release (10/2013), a per-OSD read performance is approximately 4,000 IOPS and a per node limit of around 
35,000 IOPS when doing reads directly from pagecache. This appears to indicate that Ceph can make good use of spinning disks for data 
storage and may benefit from SSD backed OSDs, though may also be limited on high performance SSDs."

Maybe Intank could comment about the 4000iops by osd ?

----- Mail original ----- 

De: "Christian Balzer" <chibi at gol.com> 
?: ceph-users at lists.ceph.com 
Cc: "Alexandre DERUMIER" <aderumier at odiso.com> 
Envoy?: Mardi 13 Mai 2014 11:51:37 
Objet: Re: Slow IOPS on RBD compared to journal and backing devices 

Hello, 

On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: 

> Hi Christian, 
> 
> I'm going to test a full ssd cluster in coming months, 
> I'll send result on the mailing. 
> 
Looking forward to that. 

> 
> Do you have tried to use 1 osd by physical disk ? (without raid6) 
> 
No, if you look back to the last year December "Sanity check..." thread 
by me, it gives the reasons. 
In short, highest density (thus replication of 2 and to make that safe 
based on RAID6) and operational maintainability (it is a remote data 
center, so replacing broken disks is a pain). 

That cluster is fast enough for my purposes and that fio test isn't a 
typical load for it when it goes into production. 
But for designing a general purpose or high performance Ceph cluster in 
the future I'd really love to have this mystery solved. 

> Maybe they are bottleneck in osd daemon, 
> and using osd daemon by disk could help. 
> 
It might, but at the IOPS I'm seeing anybody using SSD for file storage 
should have screamed out already. 
Also given the CPU usage I'm seeing during that test run such a setup 
would probably require 32+ cores. 

Christian 

> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christian Balzer" <chibi at gol.com> 
> ?: ceph-users at lists.ceph.com 
> Envoy?: Mardi 13 Mai 2014 11:03:47 
> Objet: Re: Slow IOPS on RBD compared to journal and backing 
> devices 
> 
> 
> I'm clearly talking to myself, but whatever. 
> 
> For Greg, I've played with all the pertinent journal and filestore 
> options and TCP nodelay, no changes at all. 
> 
> Is there anybody on this ML who's running a Ceph cluster with a fast 
> network and FAST filestore, so like me with a big HW cache in front of a 
> RAID/JBODs or using SSDs for final storage? 
> 
> If so, what results do you get out of the fio statement below per OSD? 
> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, 
> which is of course vastly faster than the normal indvidual HDDs could 
> do. 
> 
> So I'm wondering if I'm hitting some inherent limitation of how fast a 
> single OSD (as in the software) can handle IOPS, given that everything 
> else has been ruled out from where I stand. 
> 
> This would also explain why none of the option changes or the use of 
> RBD caching has any measurable effect in the test case below. 
> As in, a slow OSD aka single HDD with journal on the same disk would 
> clearly benefit from even the small 32MB standard RBD cache, while in my 
> test case the only time the caching becomes noticeable is if I increase 
> the cache size to something larger than the test data size. ^o^ 
> 
> On the other hand if people here regularly get thousands or tens of 
> thousands IOPS per OSD with the appropriate HW I'm stumped. 
> 
> Christian 
> 
> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: 
> 
> > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> > 
> > > Oh, I didn't notice that. I bet you aren't getting the expected 
> > > throughput on the RAID array with OSD access patterns, and that's 
> > > applying back pressure on the journal. 
> > > 
> > 
> > In the a "picture" being worth a thousand words tradition, I give you 
> > this iostat -x output taken during a fio run: 
> > 
> > avg-cpu: %user %nice %system %iowait %steal %idle 
> > 50.82 0.00 19.43 0.17 0.00 29.58 
> > 
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s 
> > avgrq-sz avgqu-sz await r_await w_await svctm %util 
> > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 
> > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb 
> > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 
> > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 
> > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 
> > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 
> > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 
> > 
> > The %user CPU utilization is pretty much entirely the 2 OSD processes, 
> > note the nearly complete absence of iowait. 
> > 
> > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. 
> > Look at these numbers, the lack of queues, the low wait and service 
> > times (this is in ms) plus overall utilization. 
> > 
> > The only conclusion I can draw from these numbers and the network 
> > results below is that the latency happens within the OSD processes. 
> > 
> > Regards, 
> > 
> > Christian 
> > > When I suggested other tests, I meant with and without Ceph. One 
> > > particular one is OSD bench. That should be interesting to try at a 
> > > variety of block sizes. You could also try runnin RADOS bench and 
> > > smalliobench at a few different sizes. 
> > > -Greg 
> > > 
> > > On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> 
> > > wrote: 
> > > 
> > > > Hi Christian, 
> > > > 
> > > > Do you have tried without raid6, to have more osd ? 
> > > > (how many disks do you have begin the raid6 ?) 
> > > > 
> > > > 
> > > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > > 
> > > > maybe can you try without --direct=1 
> > > > 
> > > > and also enable rbd_cache 
> > > > 
> > > > ceph.conf 
> > > > [client] 
> > > > rbd cache = true 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ----- Mail original ----- 
> > > > 
> > > > De: "Christian Balzer" <chibi at gol.com <javascript:;>> 
> > > > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, 
> > > > ceph-users at lists.ceph.com <javascript:;> 
> > > > Envoy?: Jeudi 8 Mai 2014 04:49:16 
> > > > Objet: Re: Slow IOPS on RBD compared to journal and 
> > > > backing devices 
> > > > 
> > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > > > 
> > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > > <chibi at gol.com<javascript:;>> 
> > > > wrote: 
> > > > > > 
> > > > > > Hello, 
> > > > > > 
> > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. 
> > > > > > The journals are on (separate) DC 3700s, the actual OSDs are 
> > > > > > RAID6 behind an Areca 1882 with 4GB of cache. 
> > > > > > 
> > > > > > Running this fio: 
> > > > > > 
> > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > > > --iodepth=128 
> > > > > > 
> > > > > > results in: 
> > > > > > 
> > > > > > 30k IOPS on the journal SSD (as expected) 
> > > > > > 110k IOPS on the OSD (it fits neatly into the cache, no 
> > > > > > surprise there) 3200 IOPS from a VM using userspace RBD 
> > > > > > 2900 IOPS from a host kernelspace mounted RBD 
> > > > > > 
> > > > > > When running the fio from the VM RBD the utilization of the 
> > > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% 
> > > > > > (1500 IOPS after some obvious merging). 
> > > > > > The OSD processes are quite busy, reading well over 200% on 
> > > > > > atop, but the system is not CPU or otherwise resource starved 
> > > > > > at that moment. 
> > > > > > 
> > > > > > Running multiple instances of this test from several VMs on 
> > > > > > different hosts changes nothing, as in the aggregated IOPS for 
> > > > > > the whole cluster will still be around 3200 IOPS. 
> > > > > > 
> > > > > > Now clearly RBD has to deal with latency here, but the network 
> > > > > > is IPoIB with the associated low latency and the journal SSDs 
> > > > > > are the (consistently) fasted ones around. 
> > > > > > 
> > > > > > I guess what I am wondering about is if this is normal and to 
> > > > > > be expected or if not where all that potential performance got 
> > > > > > lost. 
> > > > > 
> > > > > Hmm, with 128 IOs at a time (I believe I'm reading that 
> > > > > correctly?) 
> > > > Yes, but going down to 32 doesn't change things one iota. 
> > > > Also note the multiple instances I mention up there, so that would 
> > > > be 256 IOs at a time, coming from different hosts over different 
> > > > links and nothing changes. 
> > > > 
> > > > > that's about 40ms of latency per op (for userspace RBD), which 
> > > > > seems awfully long. You should check what your client-side 
> > > > > objecter settings are; it might be limiting you to fewer 
> > > > > outstanding ops than that. 
> > > > 
> > > > Googling for client-side objecter gives a few hits on ceph devel 
> > > > and bugs and nothing at all as far as configuration options are 
> > > > concerned. Care to enlighten me where one can find those? 
> > > > 
> > > > Also note the kernelspace (3.13 if it matters) speed, which is 
> > > > very much in the same (junior league) ballpark. 
> > > > 
> > > > > If 
> > > > > it's available to you, testing with Firefly or even master would 
> > > > > be interesting ? there's some performance work that should 
> > > > > reduce latencies. 
> > > > > 
> > > > Not an option, this is going into production next week. 
> > > > 
> > > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster 
> > > > > certainly doesn't require 40ms/op, so you should probably run a 
> > > > > wider array of experiments to try and figure out where it's 
> > > > > coming from. 
> > > > 
> > > > I think we can rule out the network, NPtcp gives me: 
> > > > --- 
> > > > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec 
> > > > --- 
> > > > 
> > > > For comparison at about 512KB it reaches maximum throughput and 
> > > > still isn't that laggy: 
> > > > --- 
> > > > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec 
> > > > --- 
> > > > 
> > > > So with the network performing as well as my lengthy experience 
> > > > with IPoIB led me to believe, what else is there to look at? 
> > > > The storage nodes perform just as expected, indicated by the local 
> > > > fio tests. 
> > > > 
> > > > That pretty much leaves only Ceph/RBD to look at and I'm not 
> > > > really sure what experiments I should run on that. ^o^ 
> > > > 
> > > > Regards, 
> > > > 
> > > > Christian 
> > > > 
> > > > > -Greg 
> > > > > Software Engineer #42 @ http://inktank.com | http://ceph.com 
> > > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Christian Balzer Network/Systems Engineer 
> > > > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion 
> > > > Communications http://www.gol.com/ 
> > > > _______________________________________________ 
> > > > ceph-users mailing list 
> > > > ceph-users at lists.ceph.com <javascript:;> 
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 

-- 
Christian Balzer Network/Systems Engineer 
chibi at gol.com Global OnLine Japan/Fusion Communications 
http://www.gol.com/