Since I am running with single osd , rbd image size I am creating is close to osd disk size after partitioning for db/wal. Yes, if you have limited cpu , the benefit will not be visible as cpu will be saturated. Thanks & Regards Somnath -----Original Message----- From: Jianjian Huo [mailto:samuel.huo@xxxxxxxxx] Sent: Friday, January 27, 2017 12:24 PM To: Somnath Roy Cc: LIU, Fei; Mark Nelson (mnelson@xxxxxxxxxx); Sage Weil (sweil@xxxxxxxxxx); ceph-devel Subject: Re: BlueStore: Multi sharded kv pull request Hi Somnath, On Mon, Jan 23, 2017 at 12:14 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: > James, > 2X performance gain over filestore is for 600G volume but for 1G volume Bluestore + rocks is almost similar to filestore. For even bigger volume (per osd) I think performance will go below filestore. > Expectation is the bluestore + ZS performance should not go down with bigger volume size. But, even if we are seeing some drop (need to see why ?) it is much less than the drop with bluestore + rocks. > Yes, because of B+ Tree based architecture ZS doesn't need to do compaction and paying the cost up front , this is why with smaller volume rocks should outperform ZS but with bigger metadata (bigger volume) ZS should be giving much higher throughput. Looks like cross over point is in between 600G and 1TB image size / OSD. You are talking about the bluestore OSD size, not rbd image size, right? For the 600GB size case, bluestore rock/zs cpu utils are 2.5x~3x of filestore, this probably will hold back the 4KRW 2x performance gains, in case of limited cpu power in multiple SSD per server. Jianjian > > Thanks & Regards > Somnath > > -----Original Message----- > From: LIU, Fei [mailto:james.liu@xxxxxxxxxxxxxxx] > Sent: Sunday, January 22, 2017 11:32 PM > To: Somnath Roy; Mark Nelson (mnelson@xxxxxxxxxx); Sage Weil > (sweil@xxxxxxxxxx) > Cc: ceph-devel > Subject: Re: BlueStore: Multi sharded kv pull request > > Hi Somnath, > The data is pretty promising . Especially we saw almost double performance gain in 4k rw comparing to filestore. Even comparing with RocksDB, ZS have a big performance gain as well. That’s a big achievement in last several month. Can I assume ZS will have advantage in red performance comparing to RocksDB because of B+ tree used in ZS? are you guys having any performance data like 16K RW/RR for reference? > > Thanks, > James > 本邮件及其附件含有阿里巴巴集团的商业秘密信息,仅限于发送给上面地址中列出的个人和群组,禁止任何其他人以任何形式使用(包括但不限于全部或部分地 > 泄露、复制和散发)本邮件及其附件中的信息,如果您错收本邮件,请您立即电话或邮件通知发件人并删除本邮件。 > This email and its attachments contain confidential information from Alibaba Group.which is intended only for the person or entity whose address is listed above.Any use of information contained herein in any way(including,but not limited to,total or partial disclosure,reproduction or dissemination)by persons other than the intended recipient(s) is prohibited.If you receive this email in error,please notify the sender by phone or email immediately and delete it. > > On 1/23/17, 12:40 PM, "Somnath Roy" <Somnath.Roy@xxxxxxxxxxx> wrote: > > Hi James, > Sorry, if I was not clear earlier. The 3rd sheet in the xls is with bluestore + rocks (not ZS) and sole purpose of that is to demonstrate the benefit of multi kv sync threads. > The test I ran here without preconditioning the image unlike sheet 1 and 2. This is why we are seeing the drop and it is mostly because of rocks compaction. > > Thanks & Regards > Somnath > > -----Original Message----- > From: LIU, Fei [mailto:james.liu@xxxxxxxxxxxxxxx] > Sent: Sunday, January 22, 2017 7:04 PM > To: Somnath Roy; Mark Nelson (mnelson@xxxxxxxxxx); Sage Weil (sweil@xxxxxxxxxx) > Cc: ceph-devel > Subject: Re: BlueStore: Multi sharded kv pull request > > Hi Somnath, > Thanks for sharing data. We can see iops drops a lot from 40000 to ~15000 as time going by . Any particular reason for ZetaScale to drop that much in the time period? > > Regards, > James > > 本邮件及其附件含有阿里巴巴集团的商业秘密信息,仅限于发送给上面地址中列出的个人和群组,禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制和散发)本邮件及其附件中的信息,如果您错收本邮件,请您立即电话或邮件通知发件人并删除本邮件。 > This email and its attachments contain confidential information from Alibaba Group.which is intended only for the person or entity whose address is listed above.Any use of information contained herein in any way(including,but not limited to,total or partial disclosure,reproduction or dissemination)by persons other than the intended recipient(s) is prohibited.If you receive this email in error,please notify the sender by phone or email immediately and delete it. > > On 1/21/17, 7:32 AM, "Somnath Roy" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of Somnath.Roy@xxxxxxxxxxx> wrote: > > Hi Mark/Sage, > Please find the different comparison data in the following document. > > > https://drive.google.com/file/d/0B7W-S0z_ymMJUXVmOUhINU01c3c/view?usp= > sharing > > Please download the doc (and open in xls) as google is not able to show the graphs properly. > > Setup: > ------ > Single osd on 700G nvme drive and single osd on 2 700G nvme drive (LVM ed) > 48 core server, 40G link > Test is only for 4K RW from fio. > > 1. The first sheet is showing iops and cpu utilization for Bluestor + rocks , Bluestore + Zs and filestore. > This is with small shards and with the hack we are using for preconditioning. > Bluestore + rocks with 16K min_alloc and ZS with 4k min_alloc > > WE can see Bluestore with rocks and ZS is behaving almost similarly for a 600G image and it is ~2X higher than filestore. ZS cpu utilization and WA (data not there in xls) is higher than rocks. > > 2. Next, I created 3 LVM volumes (data/db/wal) out of 2 NVMe drives and created an image of 1TB. See in the next sheet how bluestore + rocks performance came down. Didn't have time for the filestore data but expectation is it will remain similar to previous sheet. Now, ZS is running with 16k min_alloc size here with the prototype shim implementation I was talking about in the standup. > This is with this implementation is not fully crash safe but expectation is when we will be done with implementing this write ahead log implementation in ZS it should produce similar throughput. > This is giving ~90% benefir over rocks and ZS with 4k min_alloc (like previous sheet) is giving ~50% benefit (not plotted here). Cpu util is similar to rocks. > > > 3. This sheet is to demonstrate the benefit of single kv sync vs multi kv sync with rocks. With ZS we *need* multi kv but with rocks as you can see we are gaining (~20%) only during the peak performance. > Later db is getting in the way. I think peak performance is limited to day by osd upstream , if in future we can optimize that and allow more traffic to come in the Bluestore (objectstore) , the benefit of multiple sharded kv sync will be more. Here is the pull request for this.. > > https://github.com/ceph/ceph/pull/13037 > > Thanks & Regards > Somnath > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > > > ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f