Re: RocksDB options for HDD, SSD, NVME Mixed productions

mhnx <morphinwithyou@xxxxxxxxx> · Tue, 21 Sep 2021 15:54:02 +0300

Thanks for the explanation. Then the first thing I did wrong I didn't add
levels to reach total space. I didn't know that and I've set :
max_bytes_for_level_base=536870912 and max_bytes_for_level_multiplier=10
536870912*10*10=50Gb

I have space on Nvme's. I think I can resize the partitions.
1- Set osd down
2- Migrate partition to next blocks to be able resize the partition
3- Resize DB partition block size to 60GiB * 19HDD =
4- Sed osd up

Also the other option is:
1- Remove Nvme from raid1
2- Migrate half of the partitions on New empty Nvme.
3- Resize the partitions
4- Resize the rest partitions or re-create the Nvme to get rid of
degraded Nvme pool.

It's a lot of hard work and also you said "You need to re-create OSD's for
new RocksDB options' killed my dreams.
Are you sure about this? Why OSD restart have no effect on RocksDB options?
Do I really need to re-create all 190 HDD's ? Just wow. It will take
decades to be done.

Christian Wuerdig <christian.wuerdig@xxxxxxxxx>, 21 Eyl 2021 Sal, 10:15
tarihinde şunu yazdı:

> It's been discussed a few times on the list but RocksDB levels essentially
> grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you
> need (level-1)*10 space for the next level on your drive to avoid spill over
> So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and
> since 50GB < 286 (sum of all levels) you get spill-over going from L2 to
> L3. See also
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>
> Interestingly your level base seems to be 512MB instead of the default
> 256MB - did you change that? In your case the sequence I would have
> expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at
> 5GB (since you only have 50GB partitions but you need 55.5GB at least)
> Not sure what's up with that. I think you need to re-create OSDs after
> changing these RocksDB params
>
> Overall since Pacific this no longer holds entirely true since RocksDB
> sharding was added (
> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding)
> - it was broken in 16.2.4 but looks like it's fixed in 16.2.6
>
>    1. Upgrade to Pacific
>    2. Get rid of the NVME raid
>    3. Make 160GB DB partitions
>    4. Activate RocksDB sharding
>    5. Don't worry about RocksDB params
>
> If you don't feel like upgrading to Pacific any time soon but want to make
> more efficient use of the NVME and don't mind going out on a limp I'd still
> do 2+3 plus study
> https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction
> carefully and make adjustments based on that.
> With 160GB partitions a multiplier of 7 might work well with a base size
> of 350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB)
>
> You could also try to switch to a 9x multiplier and re-create one of the
> OSDs to see how it pans out prior to dissolving the raid1 setup (given your
> settings that should result in 0.5 -> 4.5 -> 40.5 GB usage)
>
> On Tue, 21 Sept 2021 at 13:19, mhnx <morphinwithyou@xxxxxxxxx> wrote:
>
>> Hello everyone!
>> I want to understand the concept and tune my rocksDB options on nautilus
>> 14.2.16.
>>
>>      osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used
>> of
>> 50 GiB) to slow device
>>      osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of
>> 50 GiB) to slow device
>>
>> The problem is, I have the spill over warnings like the rest of the
>> community.
>> I tuned RocksDB Options with the settings below but the problem still
>> exists and I wonder if I did anything wrong. I still have the Spill Overs
>> and also some times index SSD's are getting down due to compaction
>> problems
>> and can not start them until I do offline compaction.
>>
>> Let me tell you about my hardware right?
>> Every server in my system has:
>> HDD -   19 x TOSHIBA  MG08SCA16TEY   16.0TB for EC pool.
>> SSD -    3 x SAMSUNG  MZILS960HEHP/007 GXL0 960GB
>> NVME - 2 x PM1725b 1.6TB
>>
>> I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL.
>> 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but
>> regret it now)
>>
>> So! Finally let's check my RocksDB Options:
>> [osd]
>> bluefs_buffered_io = true
>> bluestore_rocksdb_options =
>>
>> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16,
>> *max_bytes_for_level_base=536870912*,
>> *max_bytes_for_level_multiplier=10*
>>
>> *"ceph osd df tree"  *to see ssd and hdd usage, omap and meta.
>>
>> > ID  CLASS WEIGHT     REWEIGHT SIZE    RAW USE DATA    OMAP    META
>> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>> > -28        280.04810        - 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB
>> 111
>> > TiB 60.40 1.00   -            host MHNX1
>> > 178   hdd   14.60149  1.00000  15 TiB 8.6 TiB 8.5 TiB  44 KiB 126 GiB
>> 6.0
>> > TiB 59.21 0.98 174     up         osd.178
>> > 179   ssd    0.87329  1.00000 894 GiB 415 GiB  89 GiB 321 GiB 5.4 GiB
>> 479
>> > GiB 46.46 0.77 104     up         osd.179
>>
>>
>> I know the size of NVME is not suitable for 16TB HDD's. I should have more
>> but the expense is cutting us pieces. Because of that I think I'll see the
>> spill overs no matter what I do. But maybe I will make it better
>> with your help!
>>
>> *My questions are:*
>> 1- What is the meaning of (33 GiB used of 50 GiB)
>> 2- Why it's not 50GiB / 50GiB ?
>> 3- Do I have 17GiB unused area on the DB partition?
>> 4- Is there anything wrong with my Rocksdb options?
>> 5- How can I be sure and find the good Rocksdb Options for Ceph?
>> 6- How can I measure the change and test it?
>> 7- Do I need different RocksDB options for HDD's and SSD's ?
>> 8- If I stop using Nvme Raid1 to gain x2 size and resize the DB's  to
>> 160GiB. Is it worth to take Nvme faulty? Because I will lose 10HDD at the
>> same time but I have 10 Node and that's only %5 of the EC data . I use m=8
>> k=2.
>>
>> P.S: There are so many people asking and searching around this. I hope it
>> will work this time.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx