Hi Istvan,
Is that 1-1.2 billion 40KB rgw objects? If you are running EC 4+2 on a
42 OSD cluster with that many objects (and a heavily write oriented
workload), that could be hitting rocksdb pretty hard. FWIW, you might
want to look at the compaction stats provided in the OSD log. You can
also run this tool to summarize compaction events:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
The more data you accumulate in rocksdb, the higher write-amplification
you'll tend to see on compaction. While you are in compaction, some
operations may block which is why you are seeing all of the
laggy/blocked/slow warnings crop up. Writes can proceed to the WAL, but
as the buffers fill up, rocksdb may start throttling or even stalling
writes to let WAL->L0 flushes catch up. Since you are also using the
drives for block and journal, they are likely getting hit pretty hard if
you have a significant write workload with a mix of big
reads/writes/deletes for compaction with a steady stream of small
O_DSYNC writes to the WAL and whatever IO is going to the objects
themselves. What brand/model drive are you using?
Fiddiling with those rocksdb settings may or may not help, but I suspect
it will make things worse unless you are very careful. We default to
having big buffers (256MB) because it lowers write-amplification in L0
(that can unnecessarily trigger compaction in deeper layers as well).
The trade-off is higher CPU usage to keep more entries in sorted order
and longer flushes to L0 (because we are flushing more data at once).
The benefit though is much better rejection of transient data making it
into L0. That means lower (sometimes significantly lower) aggregate
time spent in compaction and less wear on the disk. That tuning that
you see used in various places originated from folks that were
benchmarking with very fast storage devices. I don't remember how
exactly they landed at those specific settings, but the bulk of the
benefit they are seeing is that smaller buffers reduce the CPU load on
the kv_sync_thread at the expense of higher write load on the disks. If
you have disks that can handle that, you can gain write performance as
CPU usage in the kv sync thread is often (not always) the bottleneck for
write workloads. IE if you have the DB/WAL on optane or some other very
fast and high endurance storage technology, you may want smaller write
buffers. For slower SAS SSDs, probably not. You could also attempt to
increase the number of buffers from 4 to say 8 (allowing twice the
amount of data to accumulate in the WAL), but the benefit there is
pretty circumstantial. If your WAL flushes are slow due to L0
compaction it will give you more runway. IE if your workload has writes
come in bursts with idle periods in-between it could help you avoid
throttling. If your workload is more continuous in nature it won't do
much other than use more disk space and memory for memtables/WAL. It
doesn't do anything afaik to help reduce the time actually spent in
compaction.
One thing that could dramatically help in this case is column family
sharding. Basically data is split over multiple shallower LSM trees.
We've seen evidence that this can lead to much lower write-amplification
and shorter compaction times when you have a lot of data. Sadly it
requires rebuilding the LSM hierarchy. I suspect on your setup it could
be quite slow to migrate and it's a pretty new feature.
Hope this helps,
Mark
On 12/2/21 5:37 AM, Szabo, Istvan (Agoda) wrote:
Hi,
Trying to understand deeper and deeper these settings based on this article: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction but still not sure what would be the best option for my workload, maybe someone is familiar with this or have similar cluster as me.
Issue:
* During compaction I have slow ops, blocked io, laggy pg.
* At the moment on the osds I already have 5 levels in the osd logs.
* I'm using the basic settings which is: "compression=kNoCompression, max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2"
* I have 1-1.2 billions of 40kb objects in my cluster
* Data is on host based ec 4:2 in a 7 nodes cluster, each node has 6x 15.3TB SAS ssd osd (no nvme for rocksdb)
Mutiple configuration can be found on the internet but most commonly tuned:
* compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
So my question:
* How I should tune this settings to speed it up in my workload?
* Also is it enough an osd restart for this settings to be applied or need to recreate the osd?
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx