On 10/04/2020 23:17, Reed Dier wrote:
Going to resurrect this thread to provide another option:
LVM-cache, ie putting a cache device in-front of the bluestore-LVM LV.
I only mention this because I noticed it in the SUSE documentation for
SES6 (based on Nautilus) here:
https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
in PetaSAN project, we support dm-writecache and it works very well. We
had done tests with other cache devices such as bcache and dm-cache,
and it is much better. it is mainly a write cache, but reads are read
from cache device if present, but does not promote reads from slow
device. Typically with hdd clusters, write latency is the issue, reads
are helped by OSD cache and in case of reduplicated pools, are much
faster anyways.
You need a recent kernel, we have an upstreamed patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/md/dm-writecache.c?h=v4.19.114&id=10b9bf59bab1018940e8949c6861d1a7fb0393a1
+ depending on your distribution, you may need an updated lvm tool set.
/Maged
* If you plan to use a fast drive as an LVM cache for multiple
OSDs, be aware that all OSD operations (including replication)
will go through the caching device. All reads will be queried
from the caching device, and are only served from the slow device
in case of a cache miss. Writes are always applied to the caching
device first, and are flushed to the slow device at a later time
('writeback' is the default caching mode).
* When deciding whether to utilize an LVM cache, verify whether the
fast drive can serve as a front for multiple OSDs while still
providing an acceptable amount of IOPS. You can test it by
measuring the maximum amount of IOPS that the fast device can
serve, and then dividing the result by the number of OSDs behind
the fast device. If the result is lower or close to the maximum
amount of IOPS that the OSD can provide without the cache, LVM
cache is probably not suited for this setup.
* The interaction of the LVM cache device with OSDs is
important. Writes are periodically flushed from the caching
device to the slow device. If the incoming traffic is sustained
and significant, the caching device will struggle to keep up with
incoming requests as well as the flushing process, resulting in
performance drop. Unless the fast device can provide much more
IOPS with better latency than the slow device, do not use LVM
cache with a sustained high volume workload. Traffic in a burst
pattern is more suited for LVM cache as it gives the cache time
to flush its dirty data without interfering with client traffic.
For a sustained low traffic workload, it is difficult to guess in
advance whether using LVM cache will improve performance. The
best test is to benchmark and compare the LVM cache setup against
the WAL/DB setup. Moreover, as small writes are heavy on the WAL
partition, it is suggested to use the fast device for the DB
and/or WAL instead of an LVM cache.
So it sounds like you could partition your NVMe for either LVM-cache,
DB/WAL, or both?
Just figured this sounded a bit more akin to what you were looking for
in your original post and figured I would share.
I don't use this, but figured I would share it.
Reed
On Apr 4, 2020, at 9:12 AM, jesper@xxxxxxxx <mailto:jesper@xxxxxxxx>
wrote:
Hi.
We have a need for "bulk" storage - but with decent write latencies.
Normally we would do this with a DAS with a Raid5 with 2GB Battery
backed write cache in front - As cheap as possible but still getting the
features of scalability of ceph.
In our "first" ceph cluster we did the same - just stuffed in BBWC
in the OSD nodes and we're fine - but now we're onto the next one and
systems like:
https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
Does not support a Raid controller like that - but is branded as for
"Ceph
Storage Solutions".
It do however support 4 NVMe slots in the front - So - some level of
"tiering" using the NVMe drives should be what is "suggested" - but what
do people do? What is recommeneded. I see multiple options:
Ceph tiering at the "pool - layer":
https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
And rumors that it is "deprectated:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
Pro: Abstract layer
Con: Deprecated? - Lots of warnings?
Offloading the block.db on NVMe / SSD:
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
Pro: Easy to deal with - seem heavily supported.
Con: As far as I can tell - this will only benefit the metadata of the
osd- not actual data. Thus a data-commit to the osd til still be
dominated
by the writelatency of the underlying - very slow HDD.
Bcache:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
Pro: Closest to the BBWC mentioned above - but with way-way larger cache
sizes.
Con: It is hard to see if I end up being the only one on the planet using
this
solution.
Eat it - Writes will be as slow as hitting dead-rust - anything that
cannot live
with that need to be entirely on SSD/NVMe.
Other?
Thanks for your input.
Jesper
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx