Re: Advice on SSD choices for WAL/DB?

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 2 Jul 2020 18:26:58 +0300

Hi Andrei,

The answers to your questions depend on Ceph version you're using. And 
major use case: RBD, RGW or Ceph-FS?

Burkhard's comments are perfectly valid for Ceph before Octopus - DB 
volume sizes to be selected from the following sequence 3-6GB, 30-60 GB, 
300+GB

Intermediate values (e.g. 100GB) would result in waste of space - 
BlueFS/RocksDB wouldn't use it.

Since Octopus the situation has been improved a bit(see 
https://github.com/ceph/ceph/pull/29687). Now one can make BlueFS to use 
additional space for higher DB levels hence making DB space assignment 
less restrictive.

From your warnings I presume that your OSDs use up to L4 in RocksDB. 
And spilled over values  are most probably pretty close to amount of 
data at L4.

So in case of [planned] Octopus I'd suggest you to reserve around 64GB 
per WAL/and DB L1-L3 plus additionally at least 40GB for L4. The more is 
better if you plan to put additional data and can afford such drives.

Deferred DB volume space extension is also available these days - hence 
you can do that gradually via adding more drives and/or extending LVM 
volume. So IMO one has to primarily care about available disk slots for 
new DB devices to be able to add more DB space if needed.

Thanks,

Igor

On 7/1/2020 6:05 PM, Andrei Mikhailovsky wrote:
Thanks for the information, Burkhard.

My current setup shows a bunch of these warnings (24 osds with spillover out of 36 which have wal/db on the ssd):

      osd.36 spilled over 1.9 GiB metadata from 'db' device (7.2 GiB used of 30 GiB) to slow device
      osd.37 spilled over 13 GiB metadata from 'db' device (4.2 GiB used of 30 GiB) to slow device
      osd.44 spilled over 26 GiB metadata from 'db' device (13 GiB used of 30 GiB) to slow device
      osd.45 spilled over 33 GiB metadata from 'db' device (10 GiB used of 30 GiB) to slow device
      osd.46 spilled over 37 GiB metadata from 'db' device (8.8 GiB used of 30 GiB) to slow device

 From the above for example, osd.36 is a 3TB disk and osd.45 is 10TB disk.

I was hoping to address those spillovers with the upgrade too, if it means increasing the ssd space. Currently we've got WAL of 1GB and DB is 30GB. Am I right in understanding that in case of osd.46 the DB size should be at least 67GB to stop the spillover (30 + 37)?

Cheers

Andrei

----- Original Message -----
From: "Burkhard Linke" <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
To: "ceph-users" <ceph-users@xxxxxxx>
Sent: Wednesday, 1 July, 2020 13:09:34
Subject:  Re: Advice on SSD choices for WAL/DB?
Hi,

On 7/1/20 1:57 PM, Andrei Mikhailovsky wrote:
Hello,

We are planning to perform a small upgrade to our cluster and slowly start
adding 12TB SATA HDD drives. We need to accommodate for additional SSD WAL/DB
requirements as well. Currently we are considering the following:

HDD Drives - Seagate EXOS 12TB
SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB

Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as
Exchange, MSSQL, etc.

  From the documentation that I've read the recommended size for DB is between 1%
  and 4% of the size of the osd. Would 2% figure be sufficient enough (so around
  240GB DB size for each 12TB osd?)

The documentation is wrong. Rocksdb uses different levels to store data,
and need to store each level either completely in the DB partition or on
the data partition. There have been a number of mail threads about the
correct sizing.

In your case the best size would be 30GB for the DB part + the WAL size
(usually 2 GB). For compaction and other actions the ideal DB size needs
to be doubled, so you end up with 62GB per OSD. Larger DB partitions are
a waste of capacity, unless it can hold the next level (300GB per OSD).

If you have spare capacity on the SSD (>100GB) you can either leave it
untouched or create a small SSD based OSD for small pools that require a
lower latency, e.g. a small extra fast pool for RBD or the RGW
configuration pools.

Also, from your experience, which is a better model for the SSD DB/WAL? Would
Intel S4510 be sufficient enough for our purpose or would the S4610 be a much
better choice? Are there any other cost effective performance to consider
instead of the above models?
The SSD model should support fast sync writes, similar to the known
requirements for filestore journal SSDs. If your selected model is a
good fit according to the test methods, then it is probably also a good
choice for bluestore DBs.

Since not all data is written to the bluestore DB (no full data journal
in contrast to filestore), the amount of data written to the SSD is
probably lower. The DWPD requirements might be lower. To be on the safe
side, use the better model (higher DWPD / "write intensive") if possible.

Regards,

Burkhard
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx