On 19/04/2024 11:02, Niklaus Hofer wrote:
Dear all
We have an HDD ceph cluster that could do with some more IOPS. One
solution we are considering is installing NVMe SSDs into the storage
nodes and using them as WAL- and/or DB devices for the Bluestore OSDs.
However, we have some questions about this and are looking for some
guidance and advice.
The first one is about the expected benefits. Before we undergo the
efforts involved in the transition, we are wondering if it is even
worth it. How much of a performance boost one can expect when adding
NVMe SSDs for WAL-devices to an HDD cluster? Plus, how much faster
than that does it get with the DB also being on SSD. Are there
rule-of-thumb number of that? Or maybe someone has done benchmarks in
the past?
The second question is of more practical nature. Are there any
best-practices on how to implement this? I was thinking we won't do
one SSD per HDD - surely an NVMe SSD is plenty fast to handle the
traffic from multiple OSDs. But what is a good ratio? Do I have one
NVMe SSD per 4 HDDs? Per 6 or even 8? Also, how should I chop-up the
SSD, using partitions or using LVM? Last but not least, if I have one
SSD handle WAL and DB for multiple OSDs, losing that SSD means losing
multiple OSDs. How do people deal with this risk? Is it generally
deemed acceptable or is this something people tend to mitigate and if
so how? Do I run multiple SSDs in RAID?
I do realize that for some of these, there might not be the one
perfect answer that fits all use cases. I am looking for best
practices and in general just trying to avoid any obvious mistakes.
Any advice is much appreciated.
Sincerely
Niklaus Hofer
Hi Niklaus,
i would recommend always having external wal/db on flash when using HDDs.
The impact depends on workload, but roughly you should see 2x better
performance for mixed workloads. The impact will be higher if you have
iops intensive load.
A client write operation will require a metadata read (if not cached) +
the data write op to the HDD + metadata write + pg log write. HDDs are
terrible with iops (100 to 200 iops), so moving the non data ops to a
faster device makes a lot of sense.
There are also metadata iops involved during other operations like
rocksdb compaction, object/snap deletions, scrubbing...that will benefit
from moving those to a fast iops device. I have seen cases where
scrubbing alone can load the HDDs.
Typically you will always use wal+db and not just wal on external device.
Using just wal will improve write latency but not iops, this could be if
your load is bursty with small queue depth, like having a small number
of client write operations compared to the total number of OSDs. But in
vast majority this is not the case and practically/economically, it is a
no brainer to use both wal+db.
For nvme:HDD ratio, yes you can go for 1:10, or if you have extra slots
you can use 1:5 using smaller capacity/cheaper nvmes, this will reduce
the impact of nvme failures.
/Maged
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx