Re: Recommendation for decent write latency performance from HDDs

Paul Emmerich <paul.emmerich@xxxxxxxx> · Sat, 11 Apr 2020 00:04:23 +0200

My main problem with LVM cache was always the unpredictable
performance. It's *very* hard to benchmark properly even in a
synthetic setup, even harder to guess anything about a real-world
workload.
And testing out both configurations for a real-world setup is often
not feasible, especially as usage patterns change over the lifetime of
a cluster.

Does anyone have any real-world experience with LVM cache?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Apr 10, 2020 at 11:19 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
>
> Going to resurrect this thread to provide another option:
>
> LVM-cache, ie putting a cache device in-front of the bluestore-LVM LV.
>
> I only mention this because I noticed it in the SUSE documentation for SES6 (based on Nautilus) here: https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>
>  If you plan to use a fast drive as an LVM cache for multiple OSDs, be aware that all OSD operations (including replication) will go through the caching device. All reads will be queried from the caching device, and are only served from the slow device in case of a cache miss. Writes are always applied to the caching device first, and are flushed to the slow device at a later time ('writeback' is the default caching mode).
> When deciding whether to utilize an LVM cache, verify whether the fast drive can serve as a front for multiple OSDs while still providing an acceptable amount of IOPS. You can test it by measuring the maximum amount of IOPS that the fast device can serve, and then dividing the result by the number of OSDs behind the fast device. If the result is lower or close to the maximum amount of IOPS that the OSD can provide without the cache, LVM cache is probably not suited for this setup.
>
> The interaction of the LVM cache device with OSDs is important. Writes are periodically flushed from the caching device to the slow device. If the incoming traffic is sustained and significant, the caching device will struggle to keep up with incoming requests as well as the flushing process, resulting in performance drop. Unless the fast device can provide much more IOPS with better latency than the slow device, do not use LVM cache with a sustained high volume workload. Traffic in a burst pattern is more suited for LVM cache as it gives the cache time to flush its dirty data without interfering with client traffic. For a sustained low traffic workload, it is difficult to guess in advance whether using LVM cache will improve performance. The best test is to benchmark and compare the LVM cache setup against the WAL/DB setup. Moreover, as small writes are heavy on the WAL partition, it is suggested to use the fast device for the DB and/or WAL instead of an LVM cache.
>
>
> So it sounds like you could partition your NVMe for either LVM-cache, DB/WAL, or both?
>
> Just figured this sounded a bit more akin to what you were looking for in your original post and figured I would share.
>
> I don't use this, but figured I would share it.
>
> Reed
>
> On Apr 4, 2020, at 9:12 AM, jesper@xxxxxxxx wrote:
>
> Hi.
>
> We have a need for "bulk" storage - but with decent write latencies.
> Normally we would do this with a DAS with a Raid5 with 2GB Battery
> backed write cache in front - As cheap as possible but still getting the
> features of scalability of ceph.
>
> In our "first" ceph cluster we did the same - just stuffed in BBWC
> in the OSD nodes and we're fine - but now we're onto the next one and
> systems like:
> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
> Does not support a Raid controller like that - but is branded as for "Ceph
> Storage Solutions".
>
> It do however support 4 NVMe slots in the front - So - some level of
> "tiering" using the NVMe drives should be what is "suggested" - but what
> do people do? What is recommeneded. I see multiple options:
>
> Ceph tiering at the "pool - layer":
> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> And rumors that it is "deprectated:
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
>
> Pro: Abstract layer
> Con: Deprecated? - Lots of warnings?
>
> Offloading the block.db on NVMe / SSD:
> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>
> Pro: Easy to deal with - seem heavily supported.
> Con: As far as I can tell - this will only benefit the metadata of the
> osd- not actual data. Thus a data-commit to the osd til still be dominated
> by the writelatency of the underlying - very slow HDD.
>
> Bcache:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>
> Pro: Closest to the BBWC mentioned above - but with way-way larger cache
> sizes.
> Con: It is hard to see if I end up being the only one on the planet using
> this
> solution.
>
> Eat it - Writes will be as slow as hitting dead-rust - anything that
> cannot live
> with that need to be entirely on SSD/NVMe.
>
> Other?
>
> Thanks for your input.
>
> Jesper
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx