Re: Best settings bluestore_rocksdb_options for my workload

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 2 Dec 2021 08:56:51 -0600

On 12/2/21 7:14 AM, Szabo, Istvan (Agoda) wrote:

Hi Mark,

Thank you the quick answer.

I'd say the data distribution in the cluster is around 50kb (on 71TB is stored 1.51B objects).
I have 3 giant buckets it is important to mention (both properly pre-sharded because the cluster is a 4 clustered active-active octopus 15.2.14 solution):
- 1st has 1.2Billions of objects (average: 40kb)
- 2nd has 120 millions of objects (average: 31kb)
- 3rd has 201 millions of objects (average: 44kb)

Also worth to mention, when I increased the pg_num on the data pool from 32 to 128 (autoscaler suggested) situation got worse a lot.

When you increase the pg count data is going to move around for a while 
as the cluster rebalances.  That definitely is going to impact 
performance until it finishes.

We used wal (1x nvme/3ssd osd) but we-ve migrated out due to many of them had spillover and the nvme was maxed out 7/7 24/24 during the week which generated iowait on all nodes.

On this picture the 15.3TB is the drives currently in the cluster: https://i.ibb.co/gdB0kHg/ssd.png
This nvme was in front of them earlier: https://i.ibb.co/hYZdWWF/nvme.png

FWIW, the larger drive appear to be based on the Samsung PM1643a.  It's 
a read oriented drive, but afaik not a terrible one.  For writes, 
Samsung claims (assuming SAS12):

~2GB/s sequential write

~70K random write

DWPD: 1

And someone tested it with JJ's tool here with reasonably decent results:

https://github.com/TheJJ/ceph-diskbench/blob/master/README.md

As far as I can tell that NVMe drive is a Kioxia (formerly Toshiba) CM6 
that is also heavily read oriented.  I know very little about these 
drives, but the specifications at 1.92TB are low for writes as far as 
NVMe goes:

https://business.kioxia.com/content/dam/kioxia/shared/business/ssd/doc/dSSD-CD6-R-product-brief.pdf

~1.15GB/s sequential write

~30K random write

DWPD: 1

Assuming that's the NVMe drive you have, it's not surprising that 
co-locating the DB/WAL on the Samsung drives did better than putting 3 
DB/WALs on the NVMe drive.

I'm not really familiar with this LSM things, is it available in octopus?
To be honest I'd be interested in anything that can help.

LSM stands for Log Structured Merge Tree.  It's how rocksdb stores data 
internally.

At the moment I'm running a test with 4osd/ssd drives in another test cluster with giant bucket. I'm curious this could help or not, if not, I'll give a try to split big buckets to smaller one, but if this solve the issue, this will need from the developers to totally rewrite their codes about the structure of the objectstore. Need to prove first form my side before announce it to everybody.

I've grabbed some of the problematic osds compaction statistics:

Osd.17:
Compaction Statistics   /var/log/ceph/ceph-osd.17.log
Total OSD Log Duration (seconds)        61055.803
Number of Compaction Events     271
Avg Compaction Time (seconds)   3.9432931107011044
Total Compaction Time (seconds) 1068.6324329999993
Avg Output Size: (MB)   645.1796198700627
Total Output Size: (MB) 174843.676984787
Total Input Records     1266053332
Total Output Records    1214405618
Avg Output Throughput (MB/s)    161.33097305927973
Avg Input Records/second        1358274.7230483424
Avg Output Records/second       1125564.7031218666
Avg Output/Input Ratio  0.938602964154396

Osd.24:
Compaction Statistics   /var/log/ceph/ceph-osd.24.log
Total OSD Log Duration (seconds)        59157.211
Number of Compaction Events     43
Avg Compaction Time (seconds)   2.2380949767441862
Total Compaction Time (seconds) 96.23808400000001
Avg Output Size: (MB)   294.34481847008993
Total Output Size: (MB) 12656.827194213867
Total Input Records     68968598
Total Output Records    64650867
Avg Output Throughput (MB/s)    128.63039035909833
Avg Input Records/second        858216.8880948307
Avg Output Records/second       771972.9627006812
Avg Output/Input Ratio  0.8799823448366146

Neither of those look super awful to be honest unless you have a really 
bad outlier or something.  On osd.24 for instance the overall log 
spanned a 16 hour time period and you only spent 96 seconds in 
compaction with the average compaction taking 2.2 seconds.  osd.17 is 
working a little harder but it's not awful yet (check and see how L0 
compaction looks).  Are you sure the cause of the laggy ops is due to 
compaction and not due to data movement from changing the PG count?

Mark

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: Thursday, December 2, 2021 7:33 PM
To: ceph-users@xxxxxxx
Subject:  Re: Best settings bluestore_rocksdb_options for my workload

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi Istvan,

Is that 1-1.2 billion 40KB rgw objects?  If you are running EC 4+2 on a
42 OSD cluster with that many objects (and a heavily write oriented workload), that could be hitting rocksdb pretty hard.  FWIW, you might want to look at the compaction stats provided in the OSD log.  You can also run this tool to summarize compaction events:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

The more data you accumulate in rocksdb, the higher write-amplification you'll tend to see on compaction.  While you are in compaction, some operations may block which is why you are seeing all of the laggy/blocked/slow warnings crop up.  Writes can proceed to the WAL, but as the buffers fill up, rocksdb may start throttling or even stalling writes to let WAL->L0 flushes catch up.  Since you are also using the drives for block and journal, they are likely getting hit pretty hard if you have a significant write workload with a mix of big reads/writes/deletes for compaction with a steady stream of small O_DSYNC writes to the WAL and whatever IO is going to the objects themselves.  What brand/model drive are you using?

Fiddiling with those rocksdb settings may or may not help, but I suspect it will make things worse unless you are very careful.  We default to having big buffers (256MB) because it lowers write-amplification in L0 (that can unnecessarily trigger compaction in deeper layers as well).
The trade-off is higher CPU usage to keep more entries in sorted order and longer flushes to L0 (because we are flushing more data at once).
The benefit though is much better rejection of transient data making it into L0.  That means lower (sometimes significantly lower) aggregate time spent in compaction and less wear on the disk.  That tuning that you see used in various places originated from folks that were benchmarking with very fast storage devices.  I don't remember how exactly they landed at those specific settings, but the bulk of the benefit they are seeing is that smaller buffers reduce the CPU load on the kv_sync_thread at the expense of higher write load on the disks.  If you have disks that can handle that, you can gain write performance as CPU usage in the kv sync thread is often (not always) the bottleneck for write workloads.  IE if you have the DB/WAL on optane or some other very fast and high endurance storage technology, you may want smaller write buffers. For slower SAS SSDs, probably not.  You could also attempt to increase the number of buffers from 4 to say 8 (allowing twice the amount of data to accumulate in the WAL), but the benefit there is pretty circumstantial.  If your WAL flushes are slow due to L0 compaction it will give you more runway.  IE if your workload has writes come in bursts with idle periods in-between it could help you avoid throttling.  If your workload is more continuous in nature it won't do
much other than use more disk space and memory for memtables/WAL.   It
doesn't do anything afaik to help reduce the time actually spent in compaction.

One thing that could dramatically help in this case is column family sharding.  Basically data is split over multiple shallower LSM trees.
We've seen evidence that this can lead to much lower write-amplification and shorter compaction times when you have a lot of data.  Sadly it requires rebuilding the LSM hierarchy.  I suspect on your setup it could be quite slow to migrate and it's a pretty new feature.

Hope this helps,

Mark

On 12/2/21 5:37 AM, Szabo, Istvan (Agoda) wrote:
Hi,

Trying to understand deeper and deeper these settings based on this article: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction but still not sure what would be the best option for my workload, maybe someone is familiar with this or have similar cluster as me.

Issue:

    *   During compaction I have slow ops, blocked io, laggy pg.
    *   At the moment on the osds I already have 5 levels in the osd logs.
    *   I'm using the basic settings which is: "compression=kNoCompression,    max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2"
    *   I have 1-1.2 billions of 40kb objects in my cluster
    *   Data is on host based ec 4:2 in a 7 nodes cluster, each node has 6x 15.3TB SAS ssd osd (no nvme for rocksdb)

Mutiple configuration can be found on the internet but most commonly tuned:

    *   compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB

So my question:

    *   How I should tune this settings to speed it up in my workload?
    *   Also is it enough an osd restart for this settings to be applied or need to recreate the osd?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx