Re: Possible improvements for a slow write speed (excluding independent SSD journals)

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 20 Apr 2015 11:19:16 -0500

How are you measuring the 300MB/s and 184MB/s?  IE is it per drive, or 
the client throughput?  Also what controller do you have?  We've seen 
some controllers from certain manufacturers start to top out at around 
1-2GB/s with write cache enabled.

Mark

On 04/20/2015 11:15 AM, Barclay Jameson wrote:
I have a SSD pool for testing (only 8 Drives) but when I do a 1 SSD with
journal and 1 SSD with Data I get > 300 MB/s write. When I change all 8
Disks to house the journal I get < 184MB/s write.

On Mon, Apr 20, 2015 at 10:16 AM, Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> wrote:

    The big question is how fast these drives can do O_DSYNC writes.
    The basic gist of this is that for every write to the journal, an
    ATA_CMD_FLUSH call is made to ensure that the device (or potentially
    the controller) know that this data really needs to be stored safely
    before the flush is acknowledged.  How this gets handled is really
    important.

    1) If devices have limited or no power loss protection, they need to
    flush the contents of any caches to non-volatile memory.  How
    quickly this can happen depends on a lot of factors, but even on
    SSDs may be slow enough to limit performance greatly relative how
    quickly writes can proceed if uninterrupted.

    * It's very important to note that some devices that lack power loss
    protection may simply *ignore* ATA_CMD_FLUSH and return immediately
    so as to appear fast, even though this means that data may become
    corrupt.  Be very careful putting journals on devices that do this!

    ** Some devices that have claimed to have power loss protection
    don't actually have capacitors big enough to flush data from cache.
    This has lead to huge amounts of confusion and you have to be very
    careful.  For a specific example see the section titled "The Truth
    About Micron's Power-Loss Portection" here:
    http://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-review-nda-placeholder

    2) Devices that feature proper power loss protection such that
    caches can be flushed in the event of power failure can safely
    ignore ATA_CMD_FLUSH and return immediately when ATA_CMD_FLUSH is
    called.  This greatly improves the performance of ceph journal
    writes and usually allows the journal to perform at or near the
    theoretical sequential write performance of the device.

    3) Some controllers may be able to intercept these calls and return
    immediately on ATA_CMD_FLUSH if they have an on-board BBU that
    functions in the same way as PLP on the drives would.  Unfortunately
    on many controllers this is tied to enabling writeback cache and
    running the drives in some kind of RAID mode (single-disk RAID0 LUNs
    are often used for Ceph OSDs with this kind of setup).  In some
    cases the controller itself can become a bottleneck with SSDs so
    it's important to test this out and make sure it works well in practice.

    Regarding the 840 EVO, it sounds like based on user reports that it
    does not have PLP and does flush data on ATA_CMD_FLUSH resulting in
    quite a bit slower performance when doing O_DSYNC writes.
    Unfortunately we don't have any in the lab we can test, but likely
    this is why you are seeing slower write performance on them when
    journals are placed on the SSD.

    Mark

    On 04/20/2015 09:48 AM, J-P Methot wrote:

        My journals are on-disk, each disk being a SSD. The reason I
        didn't go
        with dedicated drives for journals is that when designing the
        setup, I
        was told that having dedicated journal SSDs on a full-SSD setup
        would
        not give me performance increases.

        So that makes the journal disk to data disk ratio 1:1.

        The replication size is 3, yes. The pools are replicated.

        On 4/20/2015 10:43 AM, Barclay Jameson wrote:

            Are your journals on separate disks? What is your ratio of
            journal
            disks to data disks? Are you doing replication size 3 ?

            On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot
            <jpmethot@xxxxxxxxxx <mailto:jpmethot@xxxxxxxxxx>
            <mailto:jpmethot@xxxxxxxxxx <mailto:jpmethot@xxxxxxxxxx>>>
            wrote:

                 Hi,

                 This is similar to another thread running right now,
            but since our
                 current setup is completely different from the one
            described in
                 the other thread, I thought it may be better to start a
            new one.

                 We are running Ceph Firefly 0.80.8 (soon to be upgraded to
                 0.80.9). We have 6 OSD hosts with 16 OSD each (so a
            total of 96
                 OSDs). Each OSD is a Samsung SSD 840 EVO on which I can
            reach
                 write speeds of roughly 400 MB/sec, plugged in jbod on a
                 controller that can theoretically transfer at 6gb/sec.
            All of that
                 is linked to openstack compute nodes on two bonded
            10gbps links
                 (so a max transfer rate of 20 gbps).

                 When I run rados bench from the compute nodes, I reach
            the network
                 cap in read speed. However, write speeds are vastly
            inferior,
                 reaching about 920 MB/sec. If I have 4 compute nodes
            running the
                 write benchmark at the same time, I can see the number
            plummet to
                 350 MB/sec . For our planned usage, we find it to be
            rather slow,
                 considering we will run a high number of virtual
            machines in there.

                 Of course, the first thing to do would be to transfer
            the journal
                 on faster drives. However, these are SSDs we're talking
            about. We
                 don't really have access to faster drives. I must find
            a way to
                 get better write speeds. Thus, I am looking for
            suggestions as to
                 how to make it faster.

                 I have also thought of options myself like:
                 -Upgrading to the latest stable hammer version (would
            that really
                 give me a big performance increase?)
                 -Crush map modifications? (this is a long shot, but I'm
            still
                 using the default crush map, maybe there's a change
            there I could
                 make to improve performances)

                 Any suggestions as to anything else I can tweak would
            be strongly
                 appreciated.

                 For reference, here's part of my ceph.conf:

                 [global]
                 auth_service_required = cephx
                 filestore_xattr_use_omap = true
                 auth_client_required = cephx
                 auth_cluster_required = cephx
                 osd pool default size = 3

                 osd pg bits = 12
                 osd pgp bits = 12
                 osd pool default pg num = 800
                 osd pool default pgp num = 800

                 [client]
                 rbd cache = true
                 rbd cache writethrough until flush = true

                 [osd]
                 filestore_fd_cache_size = 1000000
                 filestore_omap_header_cache_size = 1000000
                 filestore_fd_cache_random = true
                 filestore_queue_max_ops = 5000
                 journal_queue_max_ops = 1000000
                 max_open_files = 1000000
                 osd journal size = 10000

                 --
                 ======================
                 Jean-Philippe Méthot
                 Administrateur système / System administrator
                 GloboTech Communications
                 Phone: 1-514-907-0050 <tel:1-514-907-0050>
            <tel:1-514-907-0050 <tel:1-514-907-0050>>
                 Toll Free: 1-(888)-GTCOMM1
                 Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750>
            <tel:1-%28514%29-907-0750>
            jpmethot@xxxxxxxxxx <mailto:jpmethot@xxxxxxxxxx>
            <mailto:jpmethot@xxxxxxxxxx <mailto:jpmethot@xxxxxxxxxx>>
            http://www.gtcomm.net

                 _______________________________________________
                 ceph-users mailing list
            ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
            <mailto:ceph-users@xxxxxxxxxxxxxx
            <mailto:ceph-users@xxxxxxxxxxxxxx>>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        --
        ======================
        Jean-Philippe Méthot
        Administrateur système / System administrator
        GloboTech Communications
        Phone: 1-514-907-0050 <tel:1-514-907-0050>
        Toll Free: 1-(888)-GTCOMM1
        Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750>
        jpmethot@xxxxxxxxxx <mailto:jpmethot@xxxxxxxxxx>
        http://www.gtcomm.net

        _______________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com