Re: Possible improvements for a slow write speed (excluding independent SSD journals)

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 20 Apr 2015 10:16:29 -0500

The big question is how fast these drives can do O_DSYNC writes.  The 
basic gist of this is that for every write to the journal, an 
ATA_CMD_FLUSH call is made to ensure that the device (or potentially the 
controller) know that this data really needs to be stored safely before 
the flush is acknowledged.  How this gets handled is really important.

1) If devices have limited or no power loss protection, they need to 
flush the contents of any caches to non-volatile memory.  How quickly 
this can happen depends on a lot of factors, but even on SSDs may be 
slow enough to limit performance greatly relative how quickly writes can 
proceed if uninterrupted.

* It's very important to note that some devices that lack power loss 
protection may simply *ignore* ATA_CMD_FLUSH and return immediately so 
as to appear fast, even though this means that data may become corrupt. 
 Be very careful putting journals on devices that do this!

** Some devices that have claimed to have power loss protection don't 
actually have capacitors big enough to flush data from cache.  This has 
lead to huge amounts of confusion and you have to be very careful.  For 
a specific example see the section titled "The Truth About Micron's 
Power-Loss Portection" here: 
http://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-review-nda-placeholder

2) Devices that feature proper power loss protection such that caches 
can be flushed in the event of power failure can safely ignore 
ATA_CMD_FLUSH and return immediately when ATA_CMD_FLUSH is called.  This 
greatly improves the performance of ceph journal writes and usually 
allows the journal to perform at or near the theoretical sequential 
write performance of the device.

3) Some controllers may be able to intercept these calls and return 
immediately on ATA_CMD_FLUSH if they have an on-board BBU that functions 
in the same way as PLP on the drives would.  Unfortunately on many 
controllers this is tied to enabling writeback cache and running the 
drives in some kind of RAID mode (single-disk RAID0 LUNs are often used 
for Ceph OSDs with this kind of setup).  In some cases the controller 
itself can become a bottleneck with SSDs so it's important to test this 
out and make sure it works well in practice.

Regarding the 840 EVO, it sounds like based on user reports that it does 
not have PLP and does flush data on ATA_CMD_FLUSH resulting in quite a 
bit slower performance when doing O_DSYNC writes.  Unfortunately we 
don't have any in the lab we can test, but likely this is why you are 
seeing slower write performance on them when journals are placed on the SSD.

Mark

On 04/20/2015 09:48 AM, J-P Methot wrote:
My journals are on-disk, each disk being a SSD. The reason I didn't go
with dedicated drives for journals is that when designing the setup, I
was told that having dedicated journal SSDs on a full-SSD setup would
not give me performance increases.

So that makes the journal disk to data disk ratio 1:1.

The replication size is 3, yes. The pools are replicated.

On 4/20/2015 10:43 AM, Barclay Jameson wrote:
Are your journals on separate disks? What is your ratio of journal
disks to data disks? Are you doing replication size 3 ?

On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot <jpmethot@xxxxxxxxxx
<mailto:jpmethot@xxxxxxxxxx>> wrote:

    Hi,

    This is similar to another thread running right now, but since our
    current setup is completely different from the one described in
    the other thread, I thought it may be better to start a new one.

    We are running Ceph Firefly 0.80.8 (soon to be upgraded to
    0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96
    OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach
    write speeds of roughly 400 MB/sec, plugged in jbod on a
    controller that can theoretically transfer at 6gb/sec. All of that
    is linked to openstack compute nodes on two bonded 10gbps links
    (so a max transfer rate of 20 gbps).

    When I run rados bench from the compute nodes, I reach the network
    cap in read speed. However, write speeds are vastly inferior,
    reaching about 920 MB/sec. If I have 4 compute nodes running the
    write benchmark at the same time, I can see the number plummet to
    350 MB/sec . For our planned usage, we find it to be rather slow,
    considering we will run a high number of virtual machines in there.

    Of course, the first thing to do would be to transfer the journal
    on faster drives. However, these are SSDs we're talking about. We
    don't really have access to faster drives. I must find a way to
    get better write speeds. Thus, I am looking for suggestions as to
    how to make it faster.

    I have also thought of options myself like:
    -Upgrading to the latest stable hammer version (would that really
    give me a big performance increase?)
    -Crush map modifications? (this is a long shot, but I'm still
    using the default crush map, maybe there's a change there I could
    make to improve performances)

    Any suggestions as to anything else I can tweak would be strongly
    appreciated.

    For reference, here's part of my ceph.conf:

    [global]
    auth_service_required = cephx
    filestore_xattr_use_omap = true
    auth_client_required = cephx
    auth_cluster_required = cephx
    osd pool default size = 3

    osd pg bits = 12
    osd pgp bits = 12
    osd pool default pg num = 800
    osd pool default pgp num = 800

    [client]
    rbd cache = true
    rbd cache writethrough until flush = true

    [osd]
    filestore_fd_cache_size = 1000000
    filestore_omap_header_cache_size = 1000000
    filestore_fd_cache_random = true
    filestore_queue_max_ops = 5000
    journal_queue_max_ops = 1000000
    max_open_files = 1000000
    osd journal size = 10000

    --
    ======================
    Jean-Philippe Méthot
    Administrateur système / System administrator
    GloboTech Communications
    Phone: 1-514-907-0050 <tel:1-514-907-0050>
    Toll Free: 1-(888)-GTCOMM1
    Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750>
    jpmethot@xxxxxxxxxx <mailto:jpmethot@xxxxxxxxxx>
    http://www.gtcomm.net

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
======================
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmethot@xxxxxxxxxx
http://www.gtcomm.net

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com