Re: Best settings bluestore_rocksdb_options for my workload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Istvan,


Is that 1-1.2 billion 40KB rgw objects?  If you are running EC 4+2 on a 42 OSD cluster with that many objects (and a heavily write oriented workload), that could be hitting rocksdb pretty hard.  FWIW, you might want to look at the compaction stats provided in the OSD log.  You can also run this tool to summarize compaction events:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


The more data you accumulate in rocksdb, the higher write-amplification you'll tend to see on compaction.  While you are in compaction, some operations may block which is why you are seeing all of the laggy/blocked/slow warnings crop up.  Writes can proceed to the WAL, but as the buffers fill up, rocksdb may start throttling or even stalling writes to let WAL->L0 flushes catch up.  Since you are also using the drives for block and journal, they are likely getting hit pretty hard if you have a significant write workload with a mix of big reads/writes/deletes for compaction with a steady stream of small O_DSYNC writes to the WAL and whatever IO is going to the objects themselves.  What brand/model drive are you using?

Fiddiling with those rocksdb settings may or may not help, but I suspect it will make things worse unless you are very careful.  We default to having big buffers (256MB) because it lowers write-amplification in L0 (that can unnecessarily trigger compaction in deeper layers as well).  The trade-off is higher CPU usage to keep more entries in sorted order and longer flushes to L0 (because we are flushing more data at once).  The benefit though is much better rejection of transient data making it into L0.  That means lower (sometimes significantly lower) aggregate time spent in compaction and less wear on the disk.  That tuning that you see used in various places originated from folks that were benchmarking with very fast storage devices.  I don't remember how exactly they landed at those specific settings, but the bulk of the benefit they are seeing is that smaller buffers reduce the CPU load on the kv_sync_thread at the expense of higher write load on the disks.  If you have disks that can handle that, you can gain write performance as CPU usage in the kv sync thread is often (not always) the bottleneck for write workloads.  IE if you have the DB/WAL on optane or some other very fast and high endurance storage technology, you may want smaller write buffers. For slower SAS SSDs, probably not.  You could also attempt to increase the number of buffers from 4 to say 8 (allowing twice the amount of data to accumulate in the WAL), but the benefit there is pretty circumstantial.  If your WAL flushes are slow due to L0 compaction it will give you more runway.  IE if your workload has writes come in bursts with idle periods in-between it could help you avoid throttling.  If your workload is more continuous in nature it won't do much other than use more disk space and memory for memtables/WAL.   It doesn't do anything afaik to help reduce the time actually spent in compaction.

One thing that could dramatically help in this case is column family sharding.  Basically data is split over multiple shallower LSM trees.  We've seen evidence that this can lead to much lower write-amplification and shorter compaction times when you have a lot of data.  Sadly it requires rebuilding the LSM hierarchy.  I suspect on your setup it could be quite slow to migrate and it's a pretty new feature.


Hope this helps,

Mark



On 12/2/21 5:37 AM, Szabo, Istvan (Agoda) wrote:
Hi,

Trying to understand deeper and deeper these settings based on this article: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction but still not sure what would be the best option for my workload, maybe someone is familiar with this or have similar cluster as me.

Issue:

   *   During compaction I have slow ops, blocked io, laggy pg.
   *   At the moment on the osds I already have 5 levels in the osd logs.
   *   I'm using the basic settings which is: "compression=kNoCompression,    max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2"
   *   I have 1-1.2 billions of 40kb objects in my cluster
   *   Data is on host based ec 4:2 in a 7 nodes cluster, each node has 6x 15.3TB SAS ssd osd (no nvme for rocksdb)

Mutiple configuration can be found on the internet but most commonly tuned:

   *   compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB

So my question:

   *   How I should tune this settings to speed it up in my workload?
   *   Also is it enough an osd restart for this settings to be applied or need to recreate the osd?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux