Re: Recommendation for decent write latency performance from HDDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/04/2020 20:35, huxiaoyu@xxxxxxxxxxxx wrote:


That said, with a recent kernel such as 4.19 stable release, and a decent enterprise SSD such as Intel D4510/4610, i do not need to worry about the data safety related to dm-writecache.

thanks a lot.

samuel


The patch recently went in 5.4 and backported to 4.18+, so you need to check your version has it.

This will guarantee that a fua (force unit access)/sync write generated by the OSD, will end up on media before successful return. Enterprise drives + any drive that supports PLP will guarantee stored data will not corrupt in a power failure.

If you wish, you can install PetaSAN to  test cache performance yourself.

/Maged


------------------------------------------------------------------------
huxiaoyu@xxxxxxxxxxxx

    *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
    *Date:* 2020-04-12 20:03
    *To:* huxiaoyu@xxxxxxxxxxxx <mailto:huxiaoyu@xxxxxxxxxxxx>; Reed
    Dier <mailto:reed.dier@xxxxxxxxxxx>; jesper <mailto:jesper@xxxxxxxx>
    *CC:* ceph-users <mailto:ceph-users@xxxxxxx>
    *Subject:* Re:  Re: Recommendation for decent write
    latency performance from HDDs


    On 12/04/2020 18:10, huxiaoyu@xxxxxxxxxxxx wrote:
    Dear Maged Mokhtar,

    It is very interesting to know that your experiment shows
    dm-writecache would be better than other alternatives. I have two
    questions:

    yes much better.


    1  can one cache device serve multiple HDDs? I know bcache can do
    this, which is convenient. dont know whether dm-writecache has
    such a feature

    it works on a partition, so you can partition your disk to several
    partitions to support multiple OSDs,in our ui we allow from 1-8
    partitions.

    2 Did you test whether write-back to disks from dm-writecache is
    power-safe or not. As far as know, bcache does not gurantee
    power-safe writebacks, thus i have to turn off HDD write cache
    (otherwise a data loss may occur)

    Get a recent kernel and insure it has the fua patch mentioned,
    this will correctly handle sync writes, else you may lose data.
    You also need a recent lvm tool set that support dm-writecache.
    You need also use an SSD with PLP support (enterprise models and
    some consumer models), some cheaper SSDs without PLP support can
    lose existing stored data on power loss, since their write cycle
    involves a read/erase/write block so a power loss can erase
    already stored data on such consumer devices. We also have another
    patch (see our source) that adds mirroring of metadata to
    dm-writecache to handle this, but that is not needed for decent
    drives.

    best regards,

    samuel




    ------------------------------------------------------------------------
    huxiaoyu@xxxxxxxxxxxx

        *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
        *Date:* 2020-04-12 16:45
        *To:* Reed Dier <mailto:reed.dier@xxxxxxxxxxx>; jesper
        <mailto:jesper@xxxxxxxx>
        *CC:* ceph-users <mailto:ceph-users@xxxxxxx>
        *Subject:*  Re: Recommendation for decent write
        latency performance from HDDs
        On 10/04/2020 23:17, Reed Dier wrote:
        > Going to resurrect this thread to provide another option:
        >
        > LVM-cache, ie putting a cache device in-front of the
        bluestore-LVM LV.
        >
        > I only mention this because I noticed it in the SUSE
        documentation for
        > SES6 (based on Nautilus) here:
        > https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
        in PetaSAN project, we support dm-writecache and it works
        very well. We
        had done tests with other cache devices  such as bcache and
        dm-cache,
        and it is much better. it is mainly a write cache, but reads
        are read
        from cache device if present, but does not promote reads from
        slow
        device. Typically with hdd clusters, write latency is the
        issue, reads
        are helped by OSD cache and in case of reduplicated pools,
        are much
        faster anyways.
        You need a recent kernel, we have an upstreamed patch:
        https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/md/dm-writecache.c?h=v4.19.114&id=10b9bf59bab1018940e8949c6861d1a7fb0393a1
        + depending on your distribution, you may need an updated lvm
        tool set.
        /Maged
        >
        >>   *  If you plan to use a fast drive as an LVM cache for
        multiple
        >>     OSDs, be aware that all OSD operations (including
        replication)
        >>     will go through the caching device. All reads will be
        queried
        >>     from the caching device, and are only served from the
        slow device
        >>     in case of a cache miss. Writes are always applied to
        the caching
        >>     device first, and are flushed to the slow device at a
        later time
        >>     ('writeback' is the default caching mode).
        >>   * When deciding whether to utilize an LVM cache, verify
        whether the
        >>     fast drive can serve as a front for multiple OSDs
        while still
        >>     providing an acceptable amount of IOPS. You can test it by
        >>     measuring the maximum amount of IOPS that the fast
        device can
        >>     serve, and then dividing the result by the number of
        OSDs behind
        >>     the fast device. If the result is lower or close to
        the maximum
        >>     amount of IOPS that the OSD can provide without the
        cache, LVM
        >>     cache is probably not suited for this setup.
        >>
        >>   * The interaction of the LVM cache device with OSDs is
        >>     important. Writes are periodically flushed from the
        caching
        >>     device to the slow device. If the incoming traffic
        is sustained
        >>     and significant, the caching device will struggle to
        keep up with
        >>     incoming requests as well as the flushing process,
        resulting in
        >>     performance drop. Unless the fast device can provide
        much more
        >>     IOPS with better latency than the slow device, do not
        use LVM
        >>     cache with a sustained high volume workload. Traffic
        in a burst
        >>     pattern is more suited for LVM cache as it gives the
        cache time
        >>     to flush its dirty data without interfering with
        client traffic.
        >>     For a sustained low traffic workload, it is difficult
        to guess in
        >>     advance whether using LVM cache will improve
        performance. The
        >>     best test is to benchmark and compare the LVM cache
        setup against
        >>     the WAL/DB setup. Moreover, as small writes are heavy
        on the WAL
        >>     partition, it is suggested to use the fast device for
        the DB
        >>     and/or WAL instead of an LVM cache.
        >>
        >
        > So it sounds like you could partition your NVMe for either
        LVM-cache,
        > DB/WAL, or both?
        >
        > Just figured this sounded a bit more akin to what you were
        looking for
        > in your original post and figured I would share.
        >
        > I don't use this, but figured I would share it.
        >
        > Reed
        >
        >> On Apr 4, 2020, at 9:12 AM, jesper@xxxxxxxx
        <mailto:jesper@xxxxxxxx>
        >> wrote:
        >>
        >> Hi.
        >>
        >> We have a need for "bulk" storage - but with decent write
        latencies.
        >> Normally we would do this with a DAS with a Raid5 with 2GB
        Battery
        >> backed write cache in front - As cheap as possible but
        still getting the
        >> features of scalability of ceph.
        >>
        >> In our "first" ceph cluster we did the same - just stuffed
        in BBWC
        >> in the OSD nodes and we're fine - but now we're onto the
        next one and
        >> systems like:
        >>
        https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
        >> Does not support a Raid controller like that - but is
        branded as for
        >> "Ceph
        >> Storage Solutions".
        >>
        >> It do however support 4 NVMe slots in the front - So -
        some level of
        >> "tiering" using the NVMe drives should be what is
        "suggested" - but what
        >> do people do? What is recommeneded. I see multiple options:
        >>
        >> Ceph tiering at the "pool - layer":
        >>
        https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
        >> And rumors that it is "deprectated:
        >>
        https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
        >>
        >> Pro: Abstract layer
        >> Con: Deprecated? - Lots of warnings?
        >>
        >> Offloading the block.db on NVMe / SSD:
        >>
        https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
        >>
        >> Pro: Easy to deal with - seem heavily supported.
        >> Con: As far as I can tell - this will only benefit the
        metadata of the
        >> osd- not actual data. Thus a data-commit to the osd til
        still be
        >> dominated
        >> by the writelatency of the underlying - very slow HDD.
        >>
        >> Bcache:
        >>
        http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
        >>
        >> Pro: Closest to the BBWC mentioned above - but with
        way-way larger cache
        >> sizes.
        >> Con: It is hard to see if I end up being the only one on
        the planet using
        >> this
        >> solution.
        >>
        >> Eat it - Writes will be as slow as hitting dead-rust -
        anything that
        >> cannot live
        >> with that need to be entirely on SSD/NVMe.
        >>
        >> Other?
        >>
        >> Thanks for your input.
        >>
        >> Jesper
        >> _______________________________________________
        >> ceph-users mailing list -- ceph-users@xxxxxxx
        >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
        >
        >
        > _______________________________________________
        > ceph-users mailing list -- ceph-users@xxxxxxx
        > To unsubscribe send an email to ceph-users-leave@xxxxxxx
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx
        To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux