Slow IOPS on RBD compared to journal and backing devices

chibi@xxxxxxx (Christian Balzer) · Tue, 13 May 2014 18:51:37 +0900

Hello,

On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote:

> Hi Christian,
> 
> I'm going to test a full ssd cluster in coming months,
> I'll send result on the mailing.
>
Looking forward to that.

> 
> Do you have tried to use 1 osd by physical disk ? (without raid6)
>
No, if you look back to the last year December "Sanity check..." thread
by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).   

That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production. 
But for designing a general purpose or high performance Ceph cluster in
the future I'd really love to have this mystery solved.

> Maybe they are bottleneck in osd daemon, 
> and using osd daemon by disk could help.
>
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already. 
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores. 

Christian

> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Christian Balzer" <chibi at gol.com> 
> ?: ceph-users at lists.ceph.com 
> Envoy?: Mardi 13 Mai 2014 11:03:47 
> Objet: Re: Slow IOPS on RBD compared to journal and backing
> devices 
> 
> 
> I'm clearly talking to myself, but whatever. 
> 
> For Greg, I've played with all the pertinent journal and filestore
> options and TCP nodelay, no changes at all. 
> 
> Is there anybody on this ML who's running a Ceph cluster with a fast 
> network and FAST filestore, so like me with a big HW cache in front of a 
> RAID/JBODs or using SSDs for final storage? 
> 
> If so, what results do you get out of the fio statement below per OSD? 
> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
> which is of course vastly faster than the normal indvidual HDDs could
> do. 
> 
> So I'm wondering if I'm hitting some inherent limitation of how fast a 
> single OSD (as in the software) can handle IOPS, given that everything
> else has been ruled out from where I stand. 
> 
> This would also explain why none of the option changes or the use of 
> RBD caching has any measurable effect in the test case below. 
> As in, a slow OSD aka single HDD with journal on the same disk would 
> clearly benefit from even the small 32MB standard RBD cache, while in my 
> test case the only time the caching becomes noticeable is if I increase 
> the cache size to something larger than the test data size. ^o^ 
> 
> On the other hand if people here regularly get thousands or tens of 
> thousands IOPS per OSD with the appropriate HW I'm stumped. 
> 
> Christian 
> 
> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: 
> 
> > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> > 
> > > Oh, I didn't notice that. I bet you aren't getting the expected 
> > > throughput on the RAID array with OSD access patterns, and that's 
> > > applying back pressure on the journal. 
> > > 
> > 
> > In the a "picture" being worth a thousand words tradition, I give you 
> > this iostat -x output taken during a fio run: 
> > 
> > avg-cpu: %user %nice %system %iowait %steal %idle 
> > 50.82 0.00 19.43 0.17 0.00 29.58 
> > 
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s 
> > avgrq-sz avgqu-sz await r_await w_await svctm %util 
> > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 
> > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb 
> > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 
> > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 
> > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 
> > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 
> > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 
> > 
> > The %user CPU utilization is pretty much entirely the 2 OSD processes, 
> > note the nearly complete absence of iowait. 
> > 
> > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. 
> > Look at these numbers, the lack of queues, the low wait and service 
> > times (this is in ms) plus overall utilization. 
> > 
> > The only conclusion I can draw from these numbers and the network
> > results below is that the latency happens within the OSD processes. 
> > 
> > Regards, 
> > 
> > Christian 
> > > When I suggested other tests, I meant with and without Ceph. One 
> > > particular one is OSD bench. That should be interesting to try at a 
> > > variety of block sizes. You could also try runnin RADOS bench and 
> > > smalliobench at a few different sizes. 
> > > -Greg 
> > > 
> > > On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> 
> > > wrote: 
> > > 
> > > > Hi Christian, 
> > > > 
> > > > Do you have tried without raid6, to have more osd ? 
> > > > (how many disks do you have begin the raid6 ?) 
> > > > 
> > > > 
> > > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > > 
> > > > maybe can you try without --direct=1 
> > > > 
> > > > and also enable rbd_cache 
> > > > 
> > > > ceph.conf 
> > > > [client] 
> > > > rbd cache = true 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ----- Mail original ----- 
> > > > 
> > > > De: "Christian Balzer" <chibi at gol.com <javascript:;>> 
> > > > ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, 
> > > > ceph-users at lists.ceph.com <javascript:;> 
> > > > Envoy?: Jeudi 8 Mai 2014 04:49:16 
> > > > Objet: Re: Slow IOPS on RBD compared to journal and 
> > > > backing devices 
> > > > 
> > > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > > > 
> > > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > > <chibi at gol.com<javascript:;>> 
> > > > wrote: 
> > > > > > 
> > > > > > Hello, 
> > > > > > 
> > > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each.
> > > > > > The journals are on (separate) DC 3700s, the actual OSDs are
> > > > > > RAID6 behind an Areca 1882 with 4GB of cache. 
> > > > > > 
> > > > > > Running this fio: 
> > > > > > 
> > > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > > > --iodepth=128 
> > > > > > 
> > > > > > results in: 
> > > > > > 
> > > > > > 30k IOPS on the journal SSD (as expected) 
> > > > > > 110k IOPS on the OSD (it fits neatly into the cache, no
> > > > > > surprise there) 3200 IOPS from a VM using userspace RBD 
> > > > > > 2900 IOPS from a host kernelspace mounted RBD 
> > > > > > 
> > > > > > When running the fio from the VM RBD the utilization of the 
> > > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2% 
> > > > > > (1500 IOPS after some obvious merging). 
> > > > > > The OSD processes are quite busy, reading well over 200% on
> > > > > > atop, but the system is not CPU or otherwise resource starved
> > > > > > at that moment. 
> > > > > > 
> > > > > > Running multiple instances of this test from several VMs on 
> > > > > > different hosts changes nothing, as in the aggregated IOPS for 
> > > > > > the whole cluster will still be around 3200 IOPS. 
> > > > > > 
> > > > > > Now clearly RBD has to deal with latency here, but the network
> > > > > > is IPoIB with the associated low latency and the journal SSDs
> > > > > > are the (consistently) fasted ones around. 
> > > > > > 
> > > > > > I guess what I am wondering about is if this is normal and to
> > > > > > be expected or if not where all that potential performance got
> > > > > > lost. 
> > > > > 
> > > > > Hmm, with 128 IOs at a time (I believe I'm reading that
> > > > > correctly?) 
> > > > Yes, but going down to 32 doesn't change things one iota. 
> > > > Also note the multiple instances I mention up there, so that would
> > > > be 256 IOs at a time, coming from different hosts over different
> > > > links and nothing changes. 
> > > > 
> > > > > that's about 40ms of latency per op (for userspace RBD), which 
> > > > > seems awfully long. You should check what your client-side
> > > > > objecter settings are; it might be limiting you to fewer
> > > > > outstanding ops than that. 
> > > > 
> > > > Googling for client-side objecter gives a few hits on ceph devel
> > > > and bugs and nothing at all as far as configuration options are 
> > > > concerned. Care to enlighten me where one can find those? 
> > > > 
> > > > Also note the kernelspace (3.13 if it matters) speed, which is
> > > > very much in the same (junior league) ballpark. 
> > > > 
> > > > > If 
> > > > > it's available to you, testing with Firefly or even master would
> > > > > be interesting ? there's some performance work that should
> > > > > reduce latencies. 
> > > > > 
> > > > Not an option, this is going into production next week. 
> > > > 
> > > > > But a well-tuned (or even default-tuned, I thought) Ceph cluster 
> > > > > certainly doesn't require 40ms/op, so you should probably run a 
> > > > > wider array of experiments to try and figure out where it's
> > > > > coming from. 
> > > > 
> > > > I think we can rule out the network, NPtcp gives me: 
> > > > --- 
> > > > 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec 
> > > > --- 
> > > > 
> > > > For comparison at about 512KB it reaches maximum throughput and
> > > > still isn't that laggy: 
> > > > --- 
> > > > 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec 
> > > > --- 
> > > > 
> > > > So with the network performing as well as my lengthy experience
> > > > with IPoIB led me to believe, what else is there to look at? 
> > > > The storage nodes perform just as expected, indicated by the local 
> > > > fio tests. 
> > > > 
> > > > That pretty much leaves only Ceph/RBD to look at and I'm not
> > > > really sure what experiments I should run on that. ^o^ 
> > > > 
> > > > Regards, 
> > > > 
> > > > Christian 
> > > > 
> > > > > -Greg 
> > > > > Software Engineer #42 @ http://inktank.com | http://ceph.com 
> > > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Christian Balzer Network/Systems Engineer 
> > > > chibi at gol.com <javascript:;> Global OnLine Japan/Fusion 
> > > > Communications http://www.gol.com/ 
> > > > _______________________________________________ 
> > > > ceph-users mailing list 
> > > > ceph-users at lists.ceph.com <javascript:;> 
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/