[Single OSD performance on SSD] Can't go over 3, 2K IOPS

aderumier@xxxxxxxxx (Alexandre DERUMIER) · Tue, 02 Sep 2014 09:38:58 +0200 (CEST)

Hi Sebastien,

>>I got 6340 IOPS on a single OSD SSD. (journal and data on the same partition). 

Shouldn't it better to have 2 partitions, 1 for journal and 1 for datas ?

(I'm thinking about filesystem write syncs)

----- Mail original ----- 

De: "Sebastien Han" <sebastien.han at enovance.com> 
?: "Somnath Roy" <Somnath.Roy at sandisk.com> 
Cc: ceph-users at lists.ceph.com 
Envoy?: Mardi 2 Septembre 2014 02:19:16 
Objet: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS 

Mark and all, Ceph IOPS performance has definitely improved with Giant. 
With this version: ceph version 0.84-940-g3215c52 (3215c520e1306f50d0094b5646636c02456c9df4) on Debian 7.6 with Kernel 3.14-0. 

I got 6340 IOPS on a single OSD SSD. (journal and data on the same partition). 
So basically twice the amount of IOPS that I was getting with Firefly. 

Rand reads 4k went from 12431 to 10201, so I?m a bit disappointed here. 

The SSD is still under-utilised: 

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util 
sdp1 0.00 540.37 0.00 5902.30 0.00 47.14 16.36 0.87 0.15 0.00 0.15 0.07 40.15 
sdp2 0.00 0.00 0.00 4454.67 0.00 49.16 22.60 0.31 0.07 0.00 0.07 0.07 30.61 

Thanks a ton for all your comments and assistance guys :). 

One last question for Sage (or other that might know), what?s the status of the S2FS implementation? (or maybe we are waiting for S2FS to provide atomic transactions?) 
I tried to run the OSD on f2fs however ceph-osd mkfs got stuck on a xattr test: 

fremovexattr(10, "user.test at 5848273") = 0 

On 01 Sep 2014, at 11:13, Sebastien Han <sebastien.han at enovance.com> wrote: 

> Mark, thanks a lot for experimenting this for me. 
> I?m gonna try master soon and will tell you how much I can get. 
> 
> It?s interesting to see that using 2 SSDs brings up more performance, even both SSDs are under-utilized? 
> They should be able to sustain both loads at the same time (journal and osd data). 
> 
> On 01 Sep 2014, at 09:51, Somnath Roy <Somnath.Roy at sandisk.com> wrote: 
> 
>> As I said, 107K with IOs serving from memory, not hitting the disk.. 
>> 
>> From: Jian Zhang [mailto:amberzhang86 at gmail.com] 
>> Sent: Sunday, August 31, 2014 8:54 PM 
>> To: Somnath Roy 
>> Cc: Haomai Wang; ceph-users at lists.ceph.com 
>> Subject: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS 
>> 
>> Somnath, 
>> on the small workload performance, 107k is higher than the theoretical IOPS of 520, any idea why? 
>> 
>> 
>> 
>>>> Single client is ~14K iops, but scaling as number of clients increases. 10 clients ~107K iops. ~25 cpu cores are used. 
>> 
>> 
>> 2014-09-01 11:52 GMT+08:00 Jian Zhang <amberzhang86 at gmail.com>: 
>> Somnath, 
>> on the small workload performance, 
>> 
>> 
>> 
>> 2014-08-29 14:37 GMT+08:00 Somnath Roy <Somnath.Roy at sandisk.com>: 
>> 
>> Thanks Haomai ! 
>> 
>> Here is some of the data from my setup. 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 
>> 
>> Set up: 
>> 
>> -------- 
>> 
>> 
>> 
>> 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) -> one OSD. 5 client m/c with 12 core cpu and each running two instances of ceph_smalliobench (10 clients total). Network is 10GbE. 
>> 
>> 
>> 
>> Workload: 
>> 
>> ------------- 
>> 
>> 
>> 
>> Small workload ? 20K objects with 4K size and io_size is also 4K RR. The intent is to serve the ios from memory so that it can uncover the performance problems within single OSD. 
>> 
>> 
>> 
>> Results from Firefly: 
>> 
>> -------------------------- 
>> 
>> 
>> 
>> Single client throughput is ~14K iops, but as the number of client increases the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 cpu cores are used. 
>> 
>> 
>> 
>> Result with latest master: 
>> 
>> ------------------------------ 
>> 
>> 
>> 
>> Single client is ~14K iops, but scaling as number of clients increases. 10 clients ~107K iops. ~25 cpu cores are used. 
>> 
>> 
>> 
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>> 
>> 
>> 
>> 
>> 
>> More realistic workload: 
>> 
>> ----------------------------- 
>> 
>> Let?s see how it is performing while > 90% of the ios are served from disks 
>> 
>> Setup: 
>> 
>> ------- 
>> 
>> 40 cpu core server as a cluster node (single node cluster) with 64 GB RAM. 8 SSDs -> 8 OSDs. One similar node for monitor and rgw. Another node for client running fio/vdbench. 4 rbds are configured with ?noshare? option. 40 GbE network 
>> 
>> 
>> 
>> Workload: 
>> 
>> ------------ 
>> 
>> 
>> 
>> 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data. Io_size = 4K RR. 
>> 
>> 
>> 
>> Results from Firefly: 
>> 
>> ------------------------ 
>> 
>> 
>> 
>> Aggregated output while 4 rbd clients stressing the cluster in parallel is ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can?t remember precisely) 
>> 
>> 
>> 
>> Results from latest master: 
>> 
>> -------------------------------- 
>> 
>> 
>> 
>> Aggregated output while 4 rbd clients stressing the cluster in parallel is ~120K IOPS , cpu is 7% idle i.e ~37-38 cpu cores. 
>> 
>> 
>> 
>> Hope this helps. 
>> 
>> 
>> 
>> Thanks & Regards 
>> 
>> Somnath 
>> 
>> 
>> 
>> -----Original Message----- 
>> From: Haomai Wang [mailto:haomaiwang at gmail.com] 
>> Sent: Thursday, August 28, 2014 8:01 PM 
>> To: Somnath Roy 
>> Cc: Andrey Korolyov; ceph-users at lists.ceph.com 
>> Subject: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS 
>> 
>> 
>> Hi Roy, 
>> 
>> 
>> 
>> I already scan your merged codes about "fdcache" and "optimizing for lfn_find/lfn_open", could you give some performance improvement data about it? I fully agree with your orientation, do you have any update about it? 
>> 
>> 
>> 
>> As for messenger level, I have some very early works on it(https://github.com/yuyuyu101/ceph/tree/msg-event), it contains a new messenger implementation which support different event mechanism. 
>> 
>> It looks like at least one more week to make it work. 
>> 
>> 
>> 
>> On Fri, Aug 29, 2014 at 5:48 AM, Somnath Roy <Somnath.Roy at sandisk.com> wrote: 
>> 
>>> Yes, what I saw the messenger level bottleneck is still huge ! 
>> 
>>> Hopefully RDMA messenger will resolve that and the performance gain will be significant for Read (on SSDs). For write we need to uncover the OSD bottlenecks first to take advantage of the improved upstream. 
>> 
>>> What I experienced that till you remove the very last bottleneck the performance improvement will not be visible and that could be confusing because you might think that the upstream improvement you did is not valid (which is not). 
>> 
>>> 
>> 
>>> Thanks & Regards 
>> 
>>> Somnath 
>> 
>>> -----Original Message----- 
>> 
>>> From: Andrey Korolyov [mailto:andrey at xdel.ru] 
>> 
>>> Sent: Thursday, August 28, 2014 12:57 PM 
>> 
>>> To: Somnath Roy 
>> 
>>> Cc: David Moreau Simard; Mark Nelson; ceph-users at lists.ceph.com 
>> 
>>> Subject: Re: [Single OSD performance on SSD] Can't go 
>> 
>>> over 3, 2K IOPS 
>> 
>>> 
>> 
>>> On Thu, Aug 28, 2014 at 10:48 PM, Somnath Roy <Somnath.Roy at sandisk.com> wrote: 
>> 
>>>> Nope, this will not be back ported to Firefly I guess. 
>> 
>>>> 
>> 
>>>> Thanks & Regards 
>> 
>>>> Somnath 
>> 
>>>> 
>> 
>>> 
>> 
>>> Thanks for sharing this, the first thing in thought when I looked at 
>> 
>>> this thread, was your patches :) 
>> 
>>> 
>> 
>>> If Giant will incorporate them, both the RDMA support and those should give a huge performance boost for RDMA-enabled Ceph backnets. 
>> 
>>> 
>> 
>>> ________________________________ 
>> 
>>> 
>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
>> 
>>> 
>> 
>>> _______________________________________________ 
>> 
>>> ceph-users mailing list 
>> 
>>> ceph-users at lists.ceph.com 
>> 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Best Regards, 
>> 
>> 
>> 
>> Wheat 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users at lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> ceph-users at lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> Cheers. 
> ???? 
> S?bastien Han 
> Cloud Architect 
> 
> "Always give 100%. Unless you're giving blood." 
> 
> Phone: +33 (0)1 49 70 99 72 
> Mail: sebastien.han at enovance.com 
> Address : 11 bis, rue Roqu?pine - 75008 Paris 
> Web : www.enovance.com - Twitter : @enovance 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users at lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

Cheers. 
???? 
S?bastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood." 

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien.han at enovance.com 
Address : 11 bis, rue Roqu?pine - 75008 Paris 
Web : www.enovance.com - Twitter : @enovance 

_______________________________________________ 
ceph-users mailing list 
ceph-users at lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com