Thanks for the explanation. Then the first thing I did wrong I didn't add levels to reach total space. I didn't know that and I've set : max_bytes_for_level_base=536870912 and max_bytes_for_level_multiplier=10 536870912*10*10=50Gb I have space on Nvme's. I think I can resize the partitions. 1- Set osd down 2- Migrate partition to next blocks to be able resize the partition 3- Resize DB partition block size to 60GiB * 19HDD = 4- Sed osd up Also the other option is: 1- Remove Nvme from raid1 2- Migrate half of the partitions on New empty Nvme. 3- Resize the partitions 4- Resize the rest partitions or re-create the Nvme to get rid of degraded Nvme pool. It's a lot of hard work and also you said "You need to re-create OSD's for new RocksDB options' killed my dreams. Are you sure about this? Why OSD restart have no effect on RocksDB options? Do I really need to re-create all 190 HDD's ? Just wow. It will take decades to be done. Christian Wuerdig <christian.wuerdig@xxxxxxxxx>, 21 Eyl 2021 Sal, 10:15 tarihinde şunu yazdı: > It's been discussed a few times on the list but RocksDB levels essentially > grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you > need (level-1)*10 space for the next level on your drive to avoid spill over > So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB and > since 50GB < 286 (sum of all levels) you get spill-over going from L2 to > L3. See also > https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing > > Interestingly your level base seems to be 512MB instead of the default > 256MB - did you change that? In your case the sequence I would have > expected is 0.5 -> 5 -> 50 - and you should have already seen spillover at > 5GB (since you only have 50GB partitions but you need 55.5GB at least) > Not sure what's up with that. I think you need to re-create OSDs after > changing these RocksDB params > > Overall since Pacific this no longer holds entirely true since RocksDB > sharding was added ( > https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#bluestore-rocksdb-sharding) > - it was broken in 16.2.4 but looks like it's fixed in 16.2.6 > > 1. Upgrade to Pacific > 2. Get rid of the NVME raid > 3. Make 160GB DB partitions > 4. Activate RocksDB sharding > 5. Don't worry about RocksDB params > > If you don't feel like upgrading to Pacific any time soon but want to make > more efficient use of the NVME and don't mind going out on a limp I'd still > do 2+3 plus study > https://github-wiki-see.page/m/facebook/rocksdb/wiki/Leveled-Compaction > carefully and make adjustments based on that. > With 160GB partitions a multiplier of 7 might work well with a base size > of 350MB 0.35 -> 2.45 -> 17.15 -> 120.05 (total 140GB out of 160GB) > > You could also try to switch to a 9x multiplier and re-create one of the > OSDs to see how it pans out prior to dissolving the raid1 setup (given your > settings that should result in 0.5 -> 4.5 -> 40.5 GB usage) > > On Tue, 21 Sept 2021 at 13:19, mhnx <morphinwithyou@xxxxxxxxx> wrote: > >> Hello everyone! >> I want to understand the concept and tune my rocksDB options on nautilus >> 14.2.16. >> >> osd.178 spilled over 102 GiB metadata from 'db' device (24 GiB used >> of >> 50 GiB) to slow device >> osd.180 spilled over 91 GiB metadata from 'db' device (33 GiB used of >> 50 GiB) to slow device >> >> The problem is, I have the spill over warnings like the rest of the >> community. >> I tuned RocksDB Options with the settings below but the problem still >> exists and I wonder if I did anything wrong. I still have the Spill Overs >> and also some times index SSD's are getting down due to compaction >> problems >> and can not start them until I do offline compaction. >> >> Let me tell you about my hardware right? >> Every server in my system has: >> HDD - 19 x TOSHIBA MG08SCA16TEY 16.0TB for EC pool. >> SSD - 3 x SAMSUNG MZILS960HEHP/007 GXL0 960GB >> NVME - 2 x PM1725b 1.6TB >> >> I'm using Raid 1 Nvme for Bluestore DB. I dont have WAL. >> 19*50GB = 950GB total usage on NVME. (I was thinking use the rest but >> regret it now) >> >> So! Finally let's check my RocksDB Options: >> [osd] >> bluefs_buffered_io = true >> bluestore_rocksdb_options = >> >> compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,flusher_threads=8,compaction_readahead_size=2MB,compaction_threads=16, >> *max_bytes_for_level_base=536870912*, >> *max_bytes_for_level_multiplier=10* >> >> *"ceph osd df tree" *to see ssd and hdd usage, omap and meta. >> >> > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >> > AVAIL %USE VAR PGS STATUS TYPE NAME >> > -28 280.04810 - 280 TiB 169 TiB 166 TiB 688 GiB 2.4 TiB >> 111 >> > TiB 60.40 1.00 - host MHNX1 >> > 178 hdd 14.60149 1.00000 15 TiB 8.6 TiB 8.5 TiB 44 KiB 126 GiB >> 6.0 >> > TiB 59.21 0.98 174 up osd.178 >> > 179 ssd 0.87329 1.00000 894 GiB 415 GiB 89 GiB 321 GiB 5.4 GiB >> 479 >> > GiB 46.46 0.77 104 up osd.179 >> >> >> I know the size of NVME is not suitable for 16TB HDD's. I should have more >> but the expense is cutting us pieces. Because of that I think I'll see the >> spill overs no matter what I do. But maybe I will make it better >> with your help! >> >> *My questions are:* >> 1- What is the meaning of (33 GiB used of 50 GiB) >> 2- Why it's not 50GiB / 50GiB ? >> 3- Do I have 17GiB unused area on the DB partition? >> 4- Is there anything wrong with my Rocksdb options? >> 5- How can I be sure and find the good Rocksdb Options for Ceph? >> 6- How can I measure the change and test it? >> 7- Do I need different RocksDB options for HDD's and SSD's ? >> 8- If I stop using Nvme Raid1 to gain x2 size and resize the DB's to >> 160GiB. Is it worth to take Nvme faulty? Because I will lose 10HDD at the >> same time but I have 10 Node and that's only %5 of the EC data . I use m=8 >> k=2. >> >> P.S: There are so many people asking and searching around this. I hope it >> will work this time. >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx