Re: Recommendation for decent write latency performance from HDDs

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sun, 12 Apr 2020 16:45:57 +0200

On 10/04/2020 23:17, Reed Dier wrote:
Going to resurrect this thread to provide another option:

LVM-cache, ie putting a cache device in-front of the bluestore-LVM LV.

I only mention this because I noticed it in the SUSE documentation for 
SES6 (based on Nautilus) here: 
https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html

in PetaSAN project, we support dm-writecache and it works very well. We 
had done tests with other cache devices  such as bcache and dm-cache, 
and it is much better. it is mainly a write cache, but reads are read 
from cache device if present, but does not promote reads from slow 
device. Typically with hdd clusters, write latency is the issue, reads 
are helped by OSD cache and in case of reduplicated pools, are much 
faster anyways.

You need a recent kernel, we have an upstreamed patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/md/dm-writecache.c?h=v4.19.114&id=10b9bf59bab1018940e8949c6861d1a7fb0393a1

+ depending on your distribution, you may need an updated lvm tool set.

/Maged

  *  If you plan to use a fast drive as an LVM cache for multiple
    OSDs, be aware that all OSD operations (including replication)
    will go through the caching device. All reads will be queried
    from the caching device, and are only served from the slow device
    in case of a cache miss. Writes are always applied to the caching
    device first, and are flushed to the slow device at a later time
    ('writeback' is the default caching mode).
  * When deciding whether to utilize an LVM cache, verify whether the
    fast drive can serve as a front for multiple OSDs while still
    providing an acceptable amount of IOPS. You can test it by
    measuring the maximum amount of IOPS that the fast device can
    serve, and then dividing the result by the number of OSDs behind
    the fast device. If the result is lower or close to the maximum
    amount of IOPS that the OSD can provide without the cache, LVM
    cache is probably not suited for this setup.

  * The interaction of the LVM cache device with OSDs is
    important. Writes are periodically flushed from the caching
    device to the slow device. If the incoming traffic is sustained
    and significant, the caching device will struggle to keep up with
    incoming requests as well as the flushing process, resulting in
    performance drop. Unless the fast device can provide much more
    IOPS with better latency than the slow device, do not use LVM
    cache with a sustained high volume workload. Traffic in a burst
    pattern is more suited for LVM cache as it gives the cache time
    to flush its dirty data without interfering with client traffic.
    For a sustained low traffic workload, it is difficult to guess in
    advance whether using LVM cache will improve performance. The
    best test is to benchmark and compare the LVM cache setup against
    the WAL/DB setup. Moreover, as small writes are heavy on the WAL
    partition, it is suggested to use the fast device for the DB
    and/or WAL instead of an LVM cache.

So it sounds like you could partition your NVMe for either LVM-cache, 
DB/WAL, or both?

Just figured this sounded a bit more akin to what you were looking for 
in your original post and figured I would share.

I don't use this, but figured I would share it.

Reed

On Apr 4, 2020, at 9:12 AM, jesper@xxxxxxxx <mailto:jesper@xxxxxxxx> 
wrote:

Hi.

We have a need for "bulk" storage - but with decent write latencies.
Normally we would do this with a DAS with a Raid5 with 2GB Battery
backed write cache in front - As cheap as possible but still getting the
features of scalability of ceph.

In our "first" ceph cluster we did the same - just stuffed in BBWC
in the OSD nodes and we're fine - but now we're onto the next one and
systems like:
https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
Does not support a Raid controller like that - but is branded as for 
"Ceph
Storage Solutions".

It do however support 4 NVMe slots in the front - So - some level of
"tiering" using the NVMe drives should be what is "suggested" - but what
do people do? What is recommeneded. I see multiple options:

Ceph tiering at the "pool - layer":
https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
And rumors that it is "deprectated:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality

Pro: Abstract layer
Con: Deprecated? - Lots of warnings?

Offloading the block.db on NVMe / SSD:
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/

Pro: Easy to deal with - seem heavily supported.
Con: As far as I can tell - this will only benefit the metadata of the
osd- not actual data. Thus a data-commit to the osd til still be 
dominated
by the writelatency of the underlying - very slow HDD.

Bcache:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html

Pro: Closest to the BBWC mentioned above - but with way-way larger cache
sizes.
Con: It is hard to see if I end up being the only one on the planet using
this
solution.

Eat it - Writes will be as slow as hitting dead-rust - anything that
cannot live
with that need to be entirely on SSD/NVMe.

Other?

Thanks for your input.

Jesper
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx