[Single OSD performance on SSD] Can't go over 3, 2K IOPS

sebastien.han@xxxxxxxxxxxx (Sebastien Han) · Tue, 2 Sep 2014 18:18:18 +0200

It would nice if you could post the results :)
Yup gitbuilder is available on debian 7.6 wheezy.

On 02 Sep 2014, at 17:55, Alexandre DERUMIER <aderumier at odiso.com> wrote:

> I'm going to install next week a small 3 nodes test ssd cluster,
> 
> I have some intel s3500 and crucial m550.
> I'll try to bench them with firefly and master.
> 
> Is a debian wheezy gitbuilder repository available ? (I'm a bit lazy to compile all packages)
> 
> 
> ----- Mail original -----
> 
> De: "Sebastien Han" <sebastien.han at enovance.com>
> ?: "Alexandre DERUMIER" <aderumier at odiso.com>
> Cc: ceph-users at lists.ceph.com, "C?dric Lemarchand" <c.lemarchand at yipikai.org>
> Envoy?: Mardi 2 Septembre 2014 15:25:05
> Objet: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS
> 
> Well the last time I ran two processes in parallel I got half the total amount available so 1,7k per client.
> 
> On 02 Sep 2014, at 15:19, Alexandre DERUMIER <aderumier at odiso.com> wrote:
> 
>> 
>> Do you have same results, if you launch 2 fio benchs in parallel on 2 differents rbd volumes ?
>> 
>> 
>> ----- Mail original -----
>> 
>> De: "Sebastien Han" <sebastien.han at enovance.com>
>> ?: "C?dric Lemarchand" <c.lemarchand at yipikai.org>
>> Cc: "Alexandre DERUMIER" <aderumier at odiso.com>, ceph-users at lists.ceph.com
>> Envoy?: Mardi 2 Septembre 2014 13:59:13
>> Objet: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS
>> 
>> @Dan, hop my bad I forgot to use these settings, I?ll try again and see how much I can get on the read performance side.
>> @Mark, thanks again and yes I believe that due to some hardware variance we have difference results, I won?t say that the deviance is decent but results are close enough to say that we experience the same limitations (ceph level).
>> @C?dric, yes I did and what fio was showing was consistent with the iostat output, same goes for disk utilisation.
>> 
>> 
>> On 02 Sep 2014, at 12:44, C?dric Lemarchand <c.lemarchand at yipikai.org> wrote:
>> 
>>> Hi Sebastian,
>>> 
>>>> Le 2 sept. 2014 ? 10:41, Sebastien Han <sebastien.han at enovance.com> a ?crit :
>>>> 
>>>> Hey,
>>>> 
>>>> Well I ran an fio job that simulates the (more or less) what ceph is doing (journal writes with dsync and o_direct) and the ssd gave me 29K IOPS too.
>>>> I could do this, but for me it definitely looks like a major waste since we don?t even get a third of the ssd performance.
>>> 
>>> Did you had a look if the raw ssd IOPS (using iostat -x for example) show same results during fio bench ?
>>> 
>>> Cheers
>>> 
>>>> 
>>>>> On 02 Sep 2014, at 09:38, Alexandre DERUMIER <aderumier at odiso.com> wrote:
>>>>> 
>>>>> Hi Sebastien,
>>>>> 
>>>>>>> I got 6340 IOPS on a single OSD SSD. (journal and data on the same partition).
>>>>> 
>>>>> Shouldn't it better to have 2 partitions, 1 for journal and 1 for datas ?
>>>>> 
>>>>> (I'm thinking about filesystem write syncs)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original -----
>>>>> 
>>>>> De: "Sebastien Han" <sebastien.han at enovance.com>
>>>>> ?: "Somnath Roy" <Somnath.Roy at sandisk.com>
>>>>> Cc: ceph-users at lists.ceph.com
>>>>> Envoy?: Mardi 2 Septembre 2014 02:19:16
>>>>> Objet: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS
>>>>> 
>>>>> Mark and all, Ceph IOPS performance has definitely improved with Giant.
>>>>> With this version: ceph version 0.84-940-g3215c52 (3215c520e1306f50d0094b5646636c02456c9df4) on Debian 7.6 with Kernel 3.14-0.
>>>>> 
>>>>> I got 6340 IOPS on a single OSD SSD. (journal and data on the same partition).
>>>>> So basically twice the amount of IOPS that I was getting with Firefly.
>>>>> 
>>>>> Rand reads 4k went from 12431 to 10201, so I?m a bit disappointed here.
>>>>> 
>>>>> The SSD is still under-utilised:
>>>>> 
>>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
>>>>> sdp1 0.00 540.37 0.00 5902.30 0.00 47.14 16.36 0.87 0.15 0.00 0.15 0.07 40.15
>>>>> sdp2 0.00 0.00 0.00 4454.67 0.00 49.16 22.60 0.31 0.07 0.00 0.07 0.07 30.61
>>>>> 
>>>>> Thanks a ton for all your comments and assistance guys :).
>>>>> 
>>>>> One last question for Sage (or other that might know), what?s the status of the S2FS implementation? (or maybe we are waiting for S2FS to provide atomic transactions?)
>>>>> I tried to run the OSD on f2fs however ceph-osd mkfs got stuck on a xattr test:
>>>>> 
>>>>> fremovexattr(10, "user.test at 5848273") = 0
>>>>> 
>>>>>> On 01 Sep 2014, at 11:13, Sebastien Han <sebastien.han at enovance.com> wrote:
>>>>>> 
>>>>>> Mark, thanks a lot for experimenting this for me.
>>>>>> I?m gonna try master soon and will tell you how much I can get.
>>>>>> 
>>>>>> It?s interesting to see that using 2 SSDs brings up more performance, even both SSDs are under-utilized?
>>>>>> They should be able to sustain both loads at the same time (journal and osd data).
>>>>>> 
>>>>>>> On 01 Sep 2014, at 09:51, Somnath Roy <Somnath.Roy at sandisk.com> wrote:
>>>>>>> 
>>>>>>> As I said, 107K with IOs serving from memory, not hitting the disk..
>>>>>>> 
>>>>>>> From: Jian Zhang [mailto:amberzhang86 at gmail.com]
>>>>>>> Sent: Sunday, August 31, 2014 8:54 PM
>>>>>>> To: Somnath Roy
>>>>>>> Cc: Haomai Wang; ceph-users at lists.ceph.com
>>>>>>> Subject: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS
>>>>>>> 
>>>>>>> Somnath,
>>>>>>> on the small workload performance, 107k is higher than the theoretical IOPS of 520, any idea why?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> Single client is ~14K iops, but scaling as number of clients increases. 10 clients ~107K iops. ~25 cpu cores are used.
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-09-01 11:52 GMT+08:00 Jian Zhang <amberzhang86 at gmail.com>:
>>>>>>> Somnath,
>>>>>>> on the small workload performance,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-08-29 14:37 GMT+08:00 Somnath Roy <Somnath.Roy at sandisk.com>:
>>>>>>> 
>>>>>>> Thanks Haomai !
>>>>>>> 
>>>>>>> Here is some of the data from my setup.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>>> 
>>>>>>> Set up:
>>>>>>> 
>>>>>>> --------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 32 core cpu with HT enabled, 128 GB RAM, one SSD (both journal and data) -> one OSD. 5 client m/c with 12 core cpu and each running two instances of ceph_smalliobench (10 clients total). Network is 10GbE.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Workload:
>>>>>>> 
>>>>>>> -------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Small workload ? 20K objects with 4K size and io_size is also 4K RR. The intent is to serve the ios from memory so that it can uncover the performance problems within single OSD.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Results from Firefly:
>>>>>>> 
>>>>>>> --------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Single client throughput is ~14K iops, but as the number of client increases the aggregated throughput is not increasing. 10 clients ~15K iops. ~9-10 cpu cores are used.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Result with latest master:
>>>>>>> 
>>>>>>> ------------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Single client is ~14K iops, but scaling as number of clients increases. 10 clients ~107K iops. ~25 cpu cores are used.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> More realistic workload:
>>>>>>> 
>>>>>>> -----------------------------
>>>>>>> 
>>>>>>> Let?s see how it is performing while > 90% of the ios are served from disks
>>>>>>> 
>>>>>>> Setup:
>>>>>>> 
>>>>>>> -------
>>>>>>> 
>>>>>>> 40 cpu core server as a cluster node (single node cluster) with 64 GB RAM. 8 SSDs -> 8 OSDs. One similar node for monitor and rgw. Another node for client running fio/vdbench. 4 rbds are configured with ?noshare? option. 40 GbE network
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Workload:
>>>>>>> 
>>>>>>> ------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 8 SSDs are populated , so, 8 * 800GB = ~6.4 TB of data. Io_size = 4K RR.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Results from Firefly:
>>>>>>> 
>>>>>>> ------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Aggregated output while 4 rbd clients stressing the cluster in parallel is ~20-25K IOPS , cpu cores used ~8-10 cores (may be less can?t remember precisely)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Results from latest master:
>>>>>>> 
>>>>>>> --------------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Aggregated output while 4 rbd clients stressing the cluster in parallel is ~120K IOPS , cpu is 7% idle i.e ~37-38 cpu cores.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hope this helps.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> 
>>>>>>> Somnath
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Haomai Wang [mailto:haomaiwang at gmail.com]
>>>>>>> Sent: Thursday, August 28, 2014 8:01 PM
>>>>>>> To: Somnath Roy
>>>>>>> Cc: Andrey Korolyov; ceph-users at lists.ceph.com
>>>>>>> Subject: Re: [Single OSD performance on SSD] Can't go over 3, 2K IOPS
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Roy,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I already scan your merged codes about "fdcache" and "optimizing for lfn_find/lfn_open", could you give some performance improvement data about it? I fully agree with your orientation, do you have any update about it? 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> As for messenger level, I have some very early works on it(https://github.com/yuyuyu101/ceph/tree/msg-event), it contains a new messenger implementation which support different event mechanism.
>>>>>>> 
>>>>>>> It looks like at least one more week to make it work.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Fri, Aug 29, 2014 at 5:48 AM, Somnath Roy <Somnath.Roy at sandisk.com> wrote:
>>>>>>>> 
>>>>>>>> Yes, what I saw the messenger level bottleneck is still huge !
>>>>>>> 
>>>>>>>> Hopefully RDMA messenger will resolve that and the performance gain will be significant for Read (on SSDs). For write we need to uncover the OSD bottlenecks first to take advantage of the improved upstream.
>>>>>>> 
>>>>>>>> What I experienced that till you remove the very last bottleneck the performance improvement will not be visible and that could be confusing because you might think that the upstream improvement you did is not valid (which is not).
>>>>>>> 
>>>>>>> 
>>>>>>>> Thanks & Regards
>>>>>>> 
>>>>>>>> Somnath
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>> 
>>>>>>>> From: Andrey Korolyov [mailto:andrey at xdel.ru]
>>>>>>> 
>>>>>>>> Sent: Thursday, August 28, 2014 12:57 PM
>>>>>>> 
>>>>>>>> To: Somnath Roy
>>>>>>> 
>>>>>>>> Cc: David Moreau Simard; Mark Nelson; ceph-users at lists.ceph.com
>>>>>>> 
>>>>>>>> Subject: Re: [Single OSD performance on SSD] Can't go 
>>>>>>> 
>>>>>>>> over 3, 2K IOPS
>>>>>>> 
>>>>>>> 
>>>>>>>> On Thu, Aug 28, 2014 at 10:48 PM, Somnath Roy <Somnath.Roy at sandisk.com> wrote:
>>>>>>> 
>>>>>>>>> Nope, this will not be back ported to Firefly I guess.
>>>>>>> 
>>>>>>> 
>>>>>>>>> Thanks & Regards
>>>>>>> 
>>>>>>>>> Somnath
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Thanks for sharing this, the first thing in thought when I looked at
>>>>>>> 
>>>>>>>> this thread, was your patches :)
>>>>>>> 
>>>>>>> 
>>>>>>>> If Giant will incorporate them, both the RDMA support and those should give a huge performance boost for RDMA-enabled Ceph backnets.
>>>>>>> 
>>>>>>> 
>>>>>>>> ________________________________
>>>>>>> 
>>>>>>> 
>>>>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>>>>> 
>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>> 
>>>>>>>> ceph-users mailing list
>>>>>>> 
>>>>>>>> ceph-users at lists.ceph.com
>>>>>>> 
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Wheat
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users at lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users at lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> 
>>>>>> 
>>>>>> Cheers.
>>>>>> ????
>>>>>> S?bastien Han
>>>>>> Cloud Architect
>>>>>> 
>>>>>> "Always give 100%. Unless you're giving blood."
>>>>>> 
>>>>>> Phone: +33 (0)1 49 70 99 72
>>>>>> Mail: sebastien.han at enovance.com
>>>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris
>>>>>> Web : www.enovance.com - Twitter : @enovance
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users at lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>>> 
>>>>> Cheers.
>>>>> ????
>>>>> S?bastien Han
>>>>> Cloud Architect
>>>>> 
>>>>> "Always give 100%. Unless you're giving blood."
>>>>> 
>>>>> Phone: +33 (0)1 49 70 99 72
>>>>> Mail: sebastien.han at enovance.com
>>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris
>>>>> Web : www.enovance.com - Twitter : @enovance
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>>> Cheers.
>>>> ????
>>>> S?bastien Han
>>>> Cloud Architect
>>>> 
>>>> "Always give 100%. Unless you're giving blood."
>>>> 
>>>> Phone: +33 (0)1 49 70 99 72
>>>> Mail: sebastien.han at enovance.com
>>>> Address : 11 bis, rue Roqu?pine - 75008 Paris
>>>> Web : www.enovance.com - Twitter : @enovance
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>> 
>> 
>> Cheers.
>> ????
>> S?bastien Han
>> Cloud Architect
>> 
>> "Always give 100%. Unless you're giving blood."
>> 
>> Phone: +33 (0)1 49 70 99 72
>> Mail: sebastien.han at enovance.com
>> Address : 11 bis, rue Roqu?pine - 75008 Paris
>> Web : www.enovance.com - Twitter : @enovance
> 
> 
> Cheers.
> ????
> S?bastien Han
> Cloud Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> Phone: +33 (0)1 49 70 99 72
> Mail: sebastien.han at enovance.com
> Address : 11 bis, rue Roqu?pine - 75008 Paris
> Web : www.enovance.com - Twitter : @enovance

Cheers.
???? 
S?bastien Han 
Cloud Architect 

"Always give 100%. Unless you're giving blood."

Phone: +33 (0)1 49 70 99 72 
Mail: sebastien.han at enovance.com 
Address : 11 bis, rue Roqu?pine - 75008 Paris
Web : www.enovance.com - Twitter : @enovance 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140902/04e7f333/attachment.pgp>