Re: Recommendation for decent write latency performance from HDDs

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sun, 12 Apr 2020 22:05:45 +0200

On 12/04/2020 21:41, huxiaoyu@xxxxxxxxxxxx wrote:
thanks again.

I will try PetaSAN later. How big is the recommended cache size 
(dm-writecache) for a OSD?

Actual number of partitions per SSD is more important, each partition 
serves 1 HDD/OSD, we allow 1-8.

For size anything above 50GB is good, more will help specially with read 
caching. You need 2% of cache size in RAM, so a 100GB partitions 
requires 2GB in RAM, RAM is used internally to keep track of things, no 
data caching happens in RAM.

/Maged

------------------------------------------------------------------------
huxiaoyu@xxxxxxxxxxxx

    *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
    *Date:* 2020-04-12 21:34
    *To:* huxiaoyu@xxxxxxxxxxxx <mailto:huxiaoyu@xxxxxxxxxxxx>; Reed
    Dier <mailto:reed.dier@xxxxxxxxxxx>; jesper <mailto:jesper@xxxxxxxx>
    *CC:* ceph-users <mailto:ceph-users@xxxxxxx>
    *Subject:* Re:  Re: Recommendation for decent write
    latency performance from HDDs

    On 12/04/2020 20:35, huxiaoyu@xxxxxxxxxxxx wrote:

    That said, with a recent kernel such as 4.19 stable release, and
    a decent enterprise SSD such as Intel D4510/4610, i do not need
    to worry about the data safety related to dm-writecache.

    thanks a lot.

    samuel

    The patch recently went in 5.4 and backported to 4.18+, so you
    need to check your version has it.

    This will guarantee that a fua (force unit access)/sync write
    generated by the OSD, will end up on media before successful
    return. Enterprise drives + any drive that supports PLP will
    guarantee stored data will not corrupt in a power failure.

    If you wish, you can install PetaSAN to  test cache performance
    yourself.

    /Maged

    ------------------------------------------------------------------------
    huxiaoyu@xxxxxxxxxxxx

        *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
        *Date:* 2020-04-12 20:03
        *To:* huxiaoyu@xxxxxxxxxxxx <mailto:huxiaoyu@xxxxxxxxxxxx>;
        Reed Dier <mailto:reed.dier@xxxxxxxxxxx>; jesper
        <mailto:jesper@xxxxxxxx>
        *CC:* ceph-users <mailto:ceph-users@xxxxxxx>
        *Subject:* Re:  Re: Recommendation for decent
        write latency performance from HDDs

        On 12/04/2020 18:10, huxiaoyu@xxxxxxxxxxxx wrote:
        Dear Maged Mokhtar，

        It is very interesting to know that your experiment shows
        dm-writecache would be better than other alternatives. I
        have two questions:

        yes much better.

        1  can one cache device serve multiple HDDs? I know bcache
        can do this, which is convenient. dont know whether
        dm-writecache has such a feature

        it works on a partition, so you can partition your disk to
        several partitions to support multiple OSDs,in our ui we
        allow from 1-8 partitions.

        2 Did you test whether write-back to disks from
        dm-writecache is power-safe or not. As far as know, bcache
        does not gurantee power-safe writebacks, thus i have to turn
        off HDD write cache (otherwise a data loss may occur)

        Get a recent kernel and insure it has the fua patch
        mentioned, this will correctly handle sync writes, else you
        may lose data. You also need a recent lvm tool set that
        support dm-writecache. You need also use an SSD with PLP
        support (enterprise models and some consumer models), some
        cheaper SSDs without PLP support can lose existing stored
        data on power loss, since their write cycle involves a
        read/erase/write block so a power loss can erase already
        stored data on such consumer devices. We also have another
        patch (see our source) that adds mirroring of metadata to
        dm-writecache to handle this, but that is not needed for
        decent drives.

        best regards,

        samuel

        ------------------------------------------------------------------------
        huxiaoyu@xxxxxxxxxxxx

            *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
            *Date:* 2020-04-12 16:45
            *To:* Reed Dier <mailto:reed.dier@xxxxxxxxxxx>; jesper
            <mailto:jesper@xxxxxxxx>
            *CC:* ceph-users <mailto:ceph-users@xxxxxxx>
            *Subject:*  Re: Recommendation for decent
            write latency performance from HDDs
            On 10/04/2020 23:17, Reed Dier wrote:
            > Going to resurrect this thread to provide another option:
            >
            > LVM-cache, ie putting a cache device in-front of the
            bluestore-LVM LV.
            >
            > I only mention this because I noticed it in the SUSE
            documentation for
            > SES6 (based on Nautilus) here:
            >
            https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
            in PetaSAN project, we support dm-writecache and it
            works very well. We
            had done tests with other cache devices such as bcache
            and dm-cache,
            and it is much better. it is mainly a write cache, but
            reads are read
            from cache device if present, but does not promote reads
            from slow
            device. Typically with hdd clusters, write latency is
            the issue, reads
            are helped by OSD cache and in case of reduplicated
            pools, are much
            faster anyways.
            You need a recent kernel, we have an upstreamed patch:
            https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/md/dm-writecache.c?h=v4.19.114&id=10b9bf59bab1018940e8949c6861d1a7fb0393a1
            + depending on your distribution, you may need an
            updated lvm tool set.
            /Maged
            >
            >>   *  If you plan to use a fast drive as an LVM cache
            for multiple
            >>     OSDs, be aware that all OSD operations (including
            replication)
            >>     will go through the caching device. All
            reads will be queried
            >>     from the caching device, and are only served from
            the slow device
            >>     in case of a cache miss. Writes are always
            applied to the caching
            >>     device first, and are flushed to the slow device
            at a later time
            >>     ('writeback' is the default caching mode).
            >>   * When deciding whether to utilize an LVM cache,
            verify whether the
            >>     fast drive can serve as a front for multiple OSDs
            while still
            >>     providing an acceptable amount of IOPS. You
            can test it by
            >>     measuring the maximum amount of IOPS that the
            fast device can
            >>     serve, and then dividing the result by the number
            of OSDs behind
            >>     the fast device. If the result is lower or close
            to the maximum
            >>     amount of IOPS that the OSD can provide without
            the cache, LVM
            >>     cache is probably not suited for this setup.
            >>
            >>   * The interaction of the LVM cache device with OSDs is
            >>     important. Writes are periodically flushed from
            the caching
            >>     device to the slow device. If the incoming
            traffic is sustained
            >>     and significant, the caching device will struggle
            to keep up with
            >>     incoming requests as well as the flushing
            process, resulting in
            >>     performance drop. Unless the fast device can
            provide much more
            >>     IOPS with better latency than the slow device, do
            not use LVM
            >>     cache with a sustained high volume workload.
            Traffic in a burst
            >>     pattern is more suited for LVM cache as it gives
            the cache time
            >>     to flush its dirty data without interfering with
            client traffic.
            >>     For a sustained low traffic workload, it is
            difficult to guess in
            >>     advance whether using LVM cache will improve
            performance. The
            >>     best test is to benchmark and compare the LVM
            cache setup against
            >>     the WAL/DB setup. Moreover, as small writes
            are heavy on the WAL
            >>     partition, it is suggested to use the fast device
            for the DB
            >>     and/or WAL instead of an LVM cache.
            >>
            >
            > So it sounds like you could partition your NVMe for
            either LVM-cache,
            > DB/WAL, or both?
            >
            > Just figured this sounded a bit more akin to what you
            were looking for
            > in your original post and figured I would share.
            >
            > I don't use this, but figured I would share it.
            >
            > Reed
            >
            >> On Apr 4, 2020, at 9:12 AM, jesper@xxxxxxxx
            <mailto:jesper@xxxxxxxx>
            >> wrote:
            >>
            >> Hi.
            >>
            >> We have a need for "bulk" storage - but with decent
            write latencies.
            >> Normally we would do this with a DAS with a Raid5
            with 2GB Battery
            >> backed write cache in front - As cheap as possible
            but still getting the
            >> features of scalability of ceph.
            >>
            >> In our "first" ceph cluster we did the same - just
            stuffed in BBWC
            >> in the OSD nodes and we're fine - but now we're onto
            the next one and
            >> systems like:
            >>
            https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
            >> Does not support a Raid controller like that - but is
            branded as for
            >> "Ceph
            >> Storage Solutions".
            >>
            >> It do however support 4 NVMe slots in the front - So
            - some level of
            >> "tiering" using the NVMe drives should be what is
            "suggested" - but what
            >> do people do? What is recommeneded. I see multiple
            options:
            >>
            >> Ceph tiering at the "pool - layer":
            >>
            https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
            >> And rumors that it is "deprectated:
            >>
            https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
            >>
            >> Pro: Abstract layer
            >> Con: Deprecated? - Lots of warnings?
            >>
            >> Offloading the block.db on NVMe / SSD:
            >>
            https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
            >>
            >> Pro: Easy to deal with - seem heavily supported.
            >> Con: As far as I can tell - this will only benefit
            the metadata of the
            >> osd- not actual data. Thus a data-commit to the osd
            til still be
            >> dominated
            >> by the writelatency of the underlying - very slow HDD.
            >>
            >> Bcache:
            >>
            http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
            >>
            >> Pro: Closest to the BBWC mentioned above - but with
            way-way larger cache
            >> sizes.
            >> Con: It is hard to see if I end up being the only one
            on the planet using
            >> this
            >> solution.
            >>
            >> Eat it - Writes will be as slow as hitting dead-rust
            - anything that
            >> cannot live
            >> with that need to be entirely on SSD/NVMe.
            >>
            >> Other?
            >>
            >> Thanks for your input.
            >>
            >> Jesper
            >> _______________________________________________
            >> ceph-users mailing list -- ceph-users@xxxxxxx
            >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
            >
            >
            > _______________________________________________
            > ceph-users mailing list -- ceph-users@xxxxxxx
            > To unsubscribe send an email to ceph-users-leave@xxxxxxx
            _______________________________________________
            ceph-users mailing list -- ceph-users@xxxxxxx
            To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx