Re: RocksDB options for HDD, SSD, NVME Mixed productions

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Tue, 21 Sep 2021 03:49:29 +0000

Let me join, having 11 bluefs spillover in my cluster. Where this settings coming from?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Sep 21., at 3:19, mhnx <morphinwithyou@xxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hello everyone!
I want to understand the concept and tune my rocksDB options on nautilus
14.2.16.

    osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used of
50 GiB) to slow device
    osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
50 GiB) to slow device

The problem is, I have the spill over warnings like the rest of the
community.
I tuned RocksDB Options with the settings below but the problem still
exists and I wonder if I did anything wrong. I still have the Spill Overs
and also some times index SSD's are getting down due to compaction problems
and can not start them until I do offline compaction.

Let me tell you about my hardware right?
Every server in my system has:
HDD -   19 x TOSHIBA  MG08SCA16TEY   16.0TB for EC pool.
SSD -    3 x SAMSUNG  MZILS960HEHP/007 GXL0 960GB
NVME - 2 x PM1725b 1.6TB

I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
regret it now)

So! Finally let's check my RocksDB Options:
[osd]
bluefs_buffered_io = true
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16,
*max_bytes_for_level_base=536870912*,
*max_bytes_for_level_multiplier=10*

*"ceph osd df tree"  *to see ssd and hdd usage, omap and meta.

ID  CLASS WEIGHT     REWEIGHT SIZE    RAW USE DATA    OMAP    META
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
-28        280.04810        - 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB 111
TiB 60.40 1.00   -            host MHNX1
178   hdd   14.60149  1.00000  15 TiB 8.6 TiB 8.5 TiB  44 KiB 126 GiB 6.0
TiB 59.21 0.98 174     up         osd.178
179   ssd    0.87329  1.00000 894 GiB 415 GiB  89 GiB 321 GiB 5.4 GiB 479
GiB 46.46 0.77 104     up         osd.179

I know the size of NVME is not suitable for 16TB HDD's. I should have more
but the expense is cutting us pieces. Because of that I think I'll see the
spill overs no matter what I do. But maybe I will make it better
with your help!

*My questions are:*
1- What is the meaning of (33 GiB used of 50 GiB)
2- Why it's not 50GiB / 50GiB ?
3- Do I have 17GiB unused area on the DB partition?
4- Is there anything wrong with my Rocksdb options?
5- How can I be sure and find the good Rocksdb Options for Ceph?
6- How can I measure the change and test it?
7- Do I need different RocksDB options for HDD's and SSD's ?
8- If I stop using Nvme Raid1 to gain x2 size and resize the DB's  to
160GiB. Is it worth to take Nvme faulty? Because I will lose 10HDD at the
same time but I have 10 Node and that's only %5 of the EC data . I use m=8
k=2.

P.S: There are so many people asking and searching around this. I hope it
will work this time.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx