Re: Best settings bluestore_rocksdb_options for my workload

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 2 Dec 2021 06:32:43 -0600

Hi Istvan,

Is that 1-1.2 billion 40KB rgw objects?  If you are running EC 4+2 on a 
42 OSD cluster with that many objects (and a heavily write oriented 
workload), that could be hitting rocksdb pretty hard.  FWIW, you might 
want to look at the compaction stats provided in the OSD log.  You can 
also run this tool to summarize compaction events:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

The more data you accumulate in rocksdb, the higher write-amplification 
you'll tend to see on compaction.  While you are in compaction, some 
operations may block which is why you are seeing all of the 
laggy/blocked/slow warnings crop up.  Writes can proceed to the WAL, but 
as the buffers fill up, rocksdb may start throttling or even stalling 
writes to let WAL->L0 flushes catch up.  Since you are also using the 
drives for block and journal, they are likely getting hit pretty hard if 
you have a significant write workload with a mix of big 
reads/writes/deletes for compaction with a steady stream of small 
O_DSYNC writes to the WAL and whatever IO is going to the objects 
themselves.  What brand/model drive are you using?

Fiddiling with those rocksdb settings may or may not help, but I suspect 
it will make things worse unless you are very careful.  We default to 
having big buffers (256MB) because it lowers write-amplification in L0 
(that can unnecessarily trigger compaction in deeper layers as well).  
The trade-off is higher CPU usage to keep more entries in sorted order 
and longer flushes to L0 (because we are flushing more data at once).  
The benefit though is much better rejection of transient data making it 
into L0.  That means lower (sometimes significantly lower) aggregate 
time spent in compaction and less wear on the disk.  That tuning that 
you see used in various places originated from folks that were 
benchmarking with very fast storage devices.  I don't remember how 
exactly they landed at those specific settings, but the bulk of the 
benefit they are seeing is that smaller buffers reduce the CPU load on 
the kv_sync_thread at the expense of higher write load on the disks.  If 
you have disks that can handle that, you can gain write performance as 
CPU usage in the kv sync thread is often (not always) the bottleneck for 
write workloads.  IE if you have the DB/WAL on optane or some other very 
fast and high endurance storage technology, you may want smaller write 
buffers. For slower SAS SSDs, probably not.  You could also attempt to 
increase the number of buffers from 4 to say 8 (allowing twice the 
amount of data to accumulate in the WAL), but the benefit there is 
pretty circumstantial.  If your WAL flushes are slow due to L0 
compaction it will give you more runway.  IE if your workload has writes 
come in bursts with idle periods in-between it could help you avoid 
throttling.  If your workload is more continuous in nature it won't do 
much other than use more disk space and memory for memtables/WAL.   It 
doesn't do anything afaik to help reduce the time actually spent in 
compaction.

One thing that could dramatically help in this case is column family 
sharding.  Basically data is split over multiple shallower LSM trees.  
We've seen evidence that this can lead to much lower write-amplification 
and shorter compaction times when you have a lot of data.  Sadly it 
requires rebuilding the LSM hierarchy.  I suspect on your setup it could 
be quite slow to migrate and it's a pretty new feature.

Hope this helps,

Mark

On 12/2/21 5:37 AM, Szabo, Istvan (Agoda) wrote:
Hi,

Trying to understand deeper and deeper these settings based on this article: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction but still not sure what would be the best option for my workload, maybe someone is familiar with this or have similar cluster as me.

Issue:

   *   During compaction I have slow ops, blocked io, laggy pg.
   *   At the moment on the osds I already have 5 levels in the osd logs.
   *   I'm using the basic settings which is: "compression=kNoCompression,    max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2"
   *   I have 1-1.2 billions of 40kb objects in my cluster
   *   Data is on host based ec 4:2 in a 7 nodes cluster, each node has 6x 15.3TB SAS ssd osd (no nvme for rocksdb)

Mutiple configuration can be found on the internet but most commonly tuned:

   *   compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB

So my question:

   *   How I should tune this settings to speed it up in my workload?
   *   Also is it enough an osd restart for this settings to be applied or need to recreate the osd?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx