Re: BlueStore: Multi sharded kv pull request

"LIU, Fei" <james.liu@xxxxxxxxxxxxxxx> · Mon, 23 Jan 2017 11:04:11 +0800

Hi Somnath,
   Thanks for sharing data.  We can see iops drops a lot from 40000 to ~15000 as time going by . Any particular reason for ZetaScale to drop that much in the time period?

   Regards,
   James

本邮件及其附件含有阿里巴巴集团的商业秘密信息，仅限于发送给上面地址中列出的个人和群组，禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制和散发）本邮件及其附件中的信息，如果您错收本邮件，请您立即电话或邮件通知发件人并删除本邮件。
This email and its attachments contain confidential information from Alibaba Group.which is intended only for the person or entity whose address is listed above.Any use of information contained herein in any way(including,but not limited to,total or partial disclosure,reproduction or dissemination)by persons other than the intended recipient(s) is prohibited.If you receive this email in error,please notify the sender by phone or email immediately and delete it.

On 1/21/17, 7:32 AM, "Somnath Roy" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of Somnath.Roy@xxxxxxxxxxx> wrote:

    Hi Mark/Sage,
    Please find the different comparison data in the following document.

    https://drive.google.com/file/d/0B7W-S0z_ymMJUXVmOUhINU01c3c/view?usp=sharing

    Please download the doc (and open in xls) as google is not able to show the graphs properly.

    Setup:
    ------
    Single osd on 700G nvme drive and single osd on 2 700G nvme drive (LVM ed)
    48 core server, 40G link
    Test is only for 4K RW from fio.

    1. The first sheet is showing iops and cpu utilization for Bluestor + rocks , Bluestore + Zs and filestore.
    This is with small shards and with the hack we are using for preconditioning.
    Bluestore + rocks with 16K min_alloc and ZS with 4k min_alloc

    WE can see Bluestore with rocks and ZS is behaving almost similarly for a 600G image and it is ~2X higher than filestore. ZS cpu utilization and WA (data not there in xls) is higher than rocks.

    2. Next, I created 3 LVM volumes (data/db/wal) out of 2 NVMe drives and created an image of 1TB. See in the next sheet how bluestore + rocks performance came down. Didn't have time for the filestore data but expectation is it will remain similar to previous sheet. Now, ZS is running with 16k min_alloc size here with the prototype shim implementation I was talking about in the standup.
    This is with this implementation is not fully crash safe but expectation is when we will be done with implementing this write ahead log implementation in ZS it should produce similar throughput.
    This is giving ~90% benefir over rocks and ZS with 4k min_alloc (like previous sheet) is giving ~50% benefit (not plotted here). Cpu util is similar to rocks.

    3. This sheet is to demonstrate the benefit of single kv sync vs multi kv sync with rocks. With ZS we *need* multi kv but with rocks as you can see we are gaining (~20%) only during the peak performance.
    Later db is getting in the way. I think peak performance is limited to day by osd upstream , if in future we can optimize that and allow more traffic to come in the Bluestore (objectstore) , the benefit of multiple sharded kv sync will be more. Here is the pull request for this..

    https://github.com/ceph/ceph/pull/13037

    Thanks & Regards
    Somnath

    ________________________________

    PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

    --
    To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    the body of a message to majordomo@xxxxxxxxxxxxxxx
    More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html