Re: Recommendation for decent write latency performance from HDDs

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sun, 12 Apr 2020 20:03:49 +0200

On 12/04/2020 18:10, huxiaoyu@xxxxxxxxxxxx wrote:
Dear Maged Mokhtar，

It is very interesting to know that your experiment shows 
dm-writecache would be better than other alternatives. I have two 
questions:

yes much better.

1  can one cache device serve multiple HDDs? I know bcache can do 
this, which is convenient. dont know whether dm-writecache has such a 
feature

it works on a partition, so you can partition your disk to several 
partitions to support multiple OSDs,in our ui we allow from 1-8 partitions.

2 Did you test whether write-back to disks from dm-writecache is 
power-safe or not. As far as know, bcache does not gurantee power-safe 
writebacks, thus i have to turn off HDD write cache (otherwise a data 
loss may occur)

Get a recent kernel and insure it has the fua patch mentioned, this will 
correctly handle sync writes, else you may lose data. You also need a 
recent lvm tool set that support dm-writecache. You need also use an SSD 
with PLP support (enterprise models and some consumer models), some 
cheaper SSDs without PLP support can lose existing stored data on power 
loss, since their write cycle involves a read/erase/write block so a 
power loss can erase already stored data on such consumer devices. We 
also have another patch (see our source) that adds mirroring of metadata 
to dm-writecache to handle this, but that is not needed for decent drives.

best regards,

samuel

------------------------------------------------------------------------
huxiaoyu@xxxxxxxxxxxx

    *From:* Maged Mokhtar <mailto:mmokhtar@xxxxxxxxxxx>
    *Date:* 2020-04-12 16:45
    *To:* Reed Dier <mailto:reed.dier@xxxxxxxxxxx>; jesper
    <mailto:jesper@xxxxxxxx>
    *CC:* ceph-users <mailto:ceph-users@xxxxxxx>
    *Subject:*  Re: Recommendation for decent write
    latency performance from HDDs
    On 10/04/2020 23:17, Reed Dier wrote:
    > Going to resurrect this thread to provide another option:
    >
    > LVM-cache, ie putting a cache device in-front of the
    bluestore-LVM LV.
    >
    > I only mention this because I noticed it in the SUSE
    documentation for
    > SES6 (based on Nautilus) here:
    > https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
    in PetaSAN project, we support dm-writecache and it works very
    well. We
    had done tests with other cache devices  such as bcache and dm-cache,
    and it is much better. it is mainly a write cache, but reads are read
    from cache device if present, but does not promote reads from slow
    device. Typically with hdd clusters, write latency is the issue,
    reads
    are helped by OSD cache and in case of reduplicated pools, are much
    faster anyways.
    You need a recent kernel, we have an upstreamed patch:
    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/md/dm-writecache.c?h=v4.19.114&id=10b9bf59bab1018940e8949c6861d1a7fb0393a1
    + depending on your distribution, you may need an updated lvm tool
    set.
    /Maged
    >
    >>   *  If you plan to use a fast drive as an LVM cache for multiple
    >>     OSDs, be aware that all OSD operations (including replication)
    >>     will go through the caching device. All reads will be queried
    >>     from the caching device, and are only served from the slow
    device
    >>     in case of a cache miss. Writes are always applied to the
    caching
    >>     device first, and are flushed to the slow device at a later
    time
    >>     ('writeback' is the default caching mode).
    >>   * When deciding whether to utilize an LVM cache, verify
    whether the
    >>     fast drive can serve as a front for multiple OSDs while still
    >>     providing an acceptable amount of IOPS. You can test it by
    >>     measuring the maximum amount of IOPS that the fast device can
    >>     serve, and then dividing the result by the number of OSDs
    behind
    >>     the fast device. If the result is lower or close to the maximum
    >>     amount of IOPS that the OSD can provide without the cache, LVM
    >>     cache is probably not suited for this setup.
    >>
    >>   * The interaction of the LVM cache device with OSDs is
    >>     important. Writes are periodically flushed from the caching
    >>     device to the slow device. If the incoming traffic is sustained
    >>     and significant, the caching device will struggle to keep
    up with
    >>     incoming requests as well as the flushing process, resulting in
    >>     performance drop. Unless the fast device can provide much more
    >>     IOPS with better latency than the slow device, do not use LVM
    >>     cache with a sustained high volume workload. Traffic in a burst
    >>     pattern is more suited for LVM cache as it gives the cache time
    >>     to flush its dirty data without interfering with client
    traffic.
    >>     For a sustained low traffic workload, it is difficult to
    guess in
    >>     advance whether using LVM cache will improve performance. The
    >>     best test is to benchmark and compare the LVM cache setup
    against
    >>     the WAL/DB setup. Moreover, as small writes are heavy on
    the WAL
    >>     partition, it is suggested to use the fast device for the DB
    >>     and/or WAL instead of an LVM cache.
    >>
    >
    > So it sounds like you could partition your NVMe for either
    LVM-cache,
    > DB/WAL, or both?
    >
    > Just figured this sounded a bit more akin to what you were
    looking for
    > in your original post and figured I would share.
    >
    > I don't use this, but figured I would share it.
    >
    > Reed
    >
    >> On Apr 4, 2020, at 9:12 AM, jesper@xxxxxxxx
    <mailto:jesper@xxxxxxxx>
    >> wrote:
    >>
    >> Hi.
    >>
    >> We have a need for "bulk" storage - but with decent write
    latencies.
    >> Normally we would do this with a DAS with a Raid5 with 2GB Battery
    >> backed write cache in front - As cheap as possible but still
    getting the
    >> features of scalability of ceph.
    >>
    >> In our "first" ceph cluster we did the same - just stuffed in BBWC
    >> in the OSD nodes and we're fine - but now we're onto the next
    one and
    >> systems like:
    >>
    https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
    >> Does not support a Raid controller like that - but is branded
    as for
    >> "Ceph
    >> Storage Solutions".
    >>
    >> It do however support 4 NVMe slots in the front - So - some
    level of
    >> "tiering" using the NVMe drives should be what is "suggested" -
    but what
    >> do people do? What is recommeneded. I see multiple options:
    >>
    >> Ceph tiering at the "pool - layer":
    >> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
    >> And rumors that it is "deprectated:
    >>
    https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
    >>
    >> Pro: Abstract layer
    >> Con: Deprecated? - Lots of warnings?
    >>
    >> Offloading the block.db on NVMe / SSD:
    >>
    https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
    >>
    >> Pro: Easy to deal with - seem heavily supported.
    >> Con: As far as I can tell - this will only benefit the metadata
    of the
    >> osd- not actual data. Thus a data-commit to the osd til still be
    >> dominated
    >> by the writelatency of the underlying - very slow HDD.
    >>
    >> Bcache:
    >>
    http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
    >>
    >> Pro: Closest to the BBWC mentioned above - but with way-way
    larger cache
    >> sizes.
    >> Con: It is hard to see if I end up being the only one on the
    planet using
    >> this
    >> solution.
    >>
    >> Eat it - Writes will be as slow as hitting dead-rust - anything
    that
    >> cannot live
    >> with that need to be entirely on SSD/NVMe.
    >>
    >> Other?
    >>
    >> Thanks for your input.
    >>
    >> Jesper
    >> _______________________________________________
    >> ceph-users mailing list -- ceph-users@xxxxxxx
    >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    >
    >
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx