Re: Best settings bluestore_rocksdb_options for my workload

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hmm, it's weird, outage was around 8:20-8:25am, this is the down osd related log lines around that time, not really matching :/
Worth to mention if I do offline compaction, that one takes like 1-2 hours to finish.

2021-12-03T07:27:21.574+0700 7fb91febd700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1638491241575027, "job": 5491, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum",
"files_L0": [2362677, 2362675, 2362673], "files_L1": [2362635, 2362636, 2362638, 2362639], "score": 1.02734, "input_data_size": 494562209}
2021-12-03T09:25:16.233+0700 7fb9206be700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1638498316234171, "job": 5504, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum",
"files_L0": [2362729, 2362727, 2362725], "files_L1": [2362678, 2362679, 2362682, 2362683], "score": 1.02415, "input_data_size": 511073420}

So if we go to the direction that you mentioned it can be due to pg movements, is there anything that I can do with that?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Mark Nelson <mnelson@xxxxxxxxxx> 
Sent: Friday, December 3, 2021 3:03 AM
To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Cc: ceph-users@xxxxxxx
Subject: Re:  Re: Best settings bluestore_rocksdb_options for my workload

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Regarding the drive: maybe.  On one hand yes, it probably has lower write endurance than a write oriented drive (optane for instance has something like 30-60dwpd).  On the other hand, there's been some disagreement on this list regarding how much write endurance actually matters if your workload is mostly read oriented.  Maybe more relevant for your current problem is how good the write (especially sequential O_DSYNC write) performance is.


Regarding compaction: if you want to really confirm, you could try seeing if the performance drops correlate with the timing of any of the compaction events in your OSD logs.  The event lines you are looking for look like this:

2021-11-04T22:30:36.951+0000 7f14fc51e700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1636065036953377, "job": 12, "event":
"compaction_started", "compaction_reason": "LevelL0FilesNum",
"files_L0": [67, 55, 50, 36], "score": 1, "input_data_size": 53445909}


Find the timestamp for one of your big drops and then do a regex search based on the timestamp (maybe plus or minus a minute or two) and presence of "compaction_started" (or more specific if needed) and just search all of your OSD logs and see if anything lines up well.  If it does, you could further try to determine if it's always the same OSDs or totally random.


Mark


On 12/2/21 9:25 AM, Szabo, Istvan (Agoda) wrote:

> Well, to be honest most probably you are right, but I don't know how to track this that it is caused due to pg movements.
> However it's more than a months now and the cluster is totally 
> unstable, have a look the drops in the throughput. These are always 
> outages from user's point of view: https://i.ibb.co/rcvRpDG/drop.png
>
> Correct me if I'm wrong, but read intensive means mostly the endurance of the drive, which means for me this drive might fail faster than the mixed use or write intensive disks due to it's memories.
>
> Thank you the examples, I'll have a read.
>
> Is there anything to set or tune on the osd to not have these pg movements this effect?
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx
> ---------------------------------------------------
>
> -----Original Message-----
> From: Mark Nelson <mnelson@xxxxxxxxxx>
> Sent: Thursday, December 2, 2021 9:57 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Re: Best settings bluestore_rocksdb_options 
> for my workload
>
> Email received from the internet. If in doubt, don't click any link nor open any attachment !
> ________________________________
>
> On 12/2/21 7:14 AM, Szabo, Istvan (Agoda) wrote:
>
>> Hi Mark,
>>
>> Thank you the quick answer.
>>
>> I'd say the data distribution in the cluster is around 50kb (on 71TB is stored 1.51B objects).
>> I have 3 giant buckets it is important to mention (both properly pre-sharded because the cluster is a 4 clustered active-active octopus 15.2.14 solution):
>> - 1st has 1.2Billions of objects (average: 40kb)
>> - 2nd has 120 millions of objects (average: 31kb)
>> - 3rd has 201 millions of objects (average: 44kb)
>>
>> Also worth to mention, when I increased the pg_num on the data pool from 32 to 128 (autoscaler suggested) situation got worse a lot.
>
> When you increase the pg count data is going to move around for a while as the cluster rebalances.  That definitely is going to impact performance until it finishes.
>
>
>> We used wal (1x nvme/3ssd osd) but we-ve migrated out due to many of them had spillover and the nvme was maxed out 7/7 24/24 during the week which generated iowait on all nodes.
>>
>> On this picture the 15.3TB is the drives currently in the cluster:
>> https://i.ibb.co/gdB0kHg/ssd.png This nvme was in front of them
>> earlier: https://i.ibb.co/hYZdWWF/nvme.png
>
> FWIW, the larger drive appear to be based on the Samsung PM1643a.  It's a read oriented drive, but afaik not a terrible one.  For writes, Samsung claims (assuming SAS12):
>
> ~2GB/s sequential write
>
> ~70K random write
>
> DWPD: 1
>
>
> And someone tested it with JJ's tool here with reasonably decent results:
>
> https://github.com/TheJJ/ceph-diskbench/blob/master/README.md
>
>
> As far as I can tell that NVMe drive is a Kioxia (formerly Toshiba) CM6 that is also heavily read oriented.  I know very little about these drives, but the specifications at 1.92TB are low for writes as far as NVMe goes:
>
> https://business.kioxia.com/content/dam/kioxia/shared/business/ssd/doc
> /dSSD-CD6-R-product-brief.pdf
>
> ~1.15GB/s sequential write
>
> ~30K random write
>
> DWPD: 1
>
>
> Assuming that's the NVMe drive you have, it's not surprising that co-locating the DB/WAL on the Samsung drives did better than putting 3 DB/WALs on the NVMe drive.
>
>
>> I'm not really familiar with this LSM things, is it available in octopus?
>> To be honest I'd be interested in anything that can help.
>
> LSM stands for Log Structured Merge Tree.  It's how rocksdb stores data internally.
>
>
>> At the moment I'm running a test with 4osd/ssd drives in another test cluster with giant bucket. I'm curious this could help or not, if not, I'll give a try to split big buckets to smaller one, but if this solve the issue, this will need from the developers to totally rewrite their codes about the structure of the objectstore. Need to prove first form my side before announce it to everybody.
>>
>> I've grabbed some of the problematic osds compaction statistics:
>>
>> Osd.17:
>> Compaction Statistics   /var/log/ceph/ceph-osd.17.log
>> Total OSD Log Duration (seconds)        61055.803
>> Number of Compaction Events     271
>> Avg Compaction Time (seconds)   3.9432931107011044
>> Total Compaction Time (seconds) 1068.6324329999993
>> Avg Output Size: (MB)   645.1796198700627
>> Total Output Size: (MB) 174843.676984787
>> Total Input Records     1266053332
>> Total Output Records    1214405618
>> Avg Output Throughput (MB/s)    161.33097305927973
>> Avg Input Records/second        1358274.7230483424
>> Avg Output Records/second       1125564.7031218666
>> Avg Output/Input Ratio  0.938602964154396
>>
>> Osd.24:
>> Compaction Statistics   /var/log/ceph/ceph-osd.24.log
>> Total OSD Log Duration (seconds)        59157.211
>> Number of Compaction Events     43
>> Avg Compaction Time (seconds)   2.2380949767441862
>> Total Compaction Time (seconds) 96.23808400000001
>> Avg Output Size: (MB)   294.34481847008993
>> Total Output Size: (MB) 12656.827194213867
>> Total Input Records     68968598
>> Total Output Records    64650867
>> Avg Output Throughput (MB/s)    128.63039035909833
>> Avg Input Records/second        858216.8880948307
>> Avg Output Records/second       771972.9627006812
>> Avg Output/Input Ratio  0.8799823448366146
>
> Neither of those look super awful to be honest unless you have a really bad outlier or something.  On osd.24 for instance the overall log spanned a 16 hour time period and you only spent 96 seconds in compaction with the average compaction taking 2.2 seconds.  osd.17 is working a little harder but it's not awful yet (check and see how L0 compaction looks).  Are you sure the cause of the laggy ops is due to compaction and not due to data movement from changing the PG count?
>
>
> Mark
>
>
>> Istvan Szabo
>> Senior Infrastructure Engineer
>> ---------------------------------------------------
>> Agoda Services Co., Ltd.
>> e: istvan.szabo@xxxxxxxxx
>> ---------------------------------------------------
>>
>> -----Original Message-----
>> From: Mark Nelson <mnelson@xxxxxxxxxx>
>> Sent: Thursday, December 2, 2021 7:33 PM
>> To: ceph-users@xxxxxxx
>> Subject:  Re: Best settings bluestore_rocksdb_options for 
>> my workload
>>
>> Email received from the internet. If in doubt, don't click any link nor open any attachment !
>> ________________________________
>>
>> Hi Istvan,
>>
>>
>> Is that 1-1.2 billion 40KB rgw objects?  If you are running EC 4+2 on 
>> a
>> 42 OSD cluster with that many objects (and a heavily write oriented workload), that could be hitting rocksdb pretty hard.  FWIW, you might want to look at the compaction stats provided in the OSD log.  You can also run this tool to summarize compaction events:
>>
>> https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.
>> py
>>
>>
>> The more data you accumulate in rocksdb, the higher write-amplification you'll tend to see on compaction.  While you are in compaction, some operations may block which is why you are seeing all of the laggy/blocked/slow warnings crop up.  Writes can proceed to the WAL, but as the buffers fill up, rocksdb may start throttling or even stalling writes to let WAL->L0 flushes catch up.  Since you are also using the drives for block and journal, they are likely getting hit pretty hard if you have a significant write workload with a mix of big reads/writes/deletes for compaction with a steady stream of small O_DSYNC writes to the WAL and whatever IO is going to the objects themselves.  What brand/model drive are you using?
>>
>> Fiddiling with those rocksdb settings may or may not help, but I suspect it will make things worse unless you are very careful.  We default to having big buffers (256MB) because it lowers write-amplification in L0 (that can unnecessarily trigger compaction in deeper layers as well).
>> The trade-off is higher CPU usage to keep more entries in sorted order and longer flushes to L0 (because we are flushing more data at once).
>> The benefit though is much better rejection of transient data making it into L0.  That means lower (sometimes significantly lower) aggregate time spent in compaction and less wear on the disk.  That tuning that you see used in various places originated from folks that were benchmarking with very fast storage devices.  I don't remember how exactly they landed at those specific settings, but the bulk of the benefit they are seeing is that smaller buffers reduce the CPU load on the kv_sync_thread at the expense of higher write load on the disks.  If you have disks that can handle that, you can gain write performance as CPU usage in the kv sync thread is often (not always) the bottleneck for write workloads.  IE if you have the DB/WAL on optane or some other very fast and high endurance storage technology, you may want smaller write buffers. For slower SAS SSDs, probably not.  You could also attempt to increase the number of buffers from 4 to say 8 (allowing twice the amount of data to accumulate in the WAL), but the benefit there is pretty circumstantial.  If your WAL flushes are slow due to L0 compaction it will give you more runway.  IE if your workload has writes come in bursts with idle periods in-between it could help you avoid throttling.  If your workload is more continuous in nature it won't do
>> much other than use more disk space and memory for memtables/WAL.   It
>> doesn't do anything afaik to help reduce the time actually spent in compaction.
>>
>> One thing that could dramatically help in this case is column family sharding.  Basically data is split over multiple shallower LSM trees.
>> We've seen evidence that this can lead to much lower write-amplification and shorter compaction times when you have a lot of data.  Sadly it requires rebuilding the LSM hierarchy.  I suspect on your setup it could be quite slow to migrate and it's a pretty new feature.
>>
>>
>> Hope this helps,
>>
>> Mark
>>
>>
>>
>> On 12/2/21 5:37 AM, Szabo, Istvan (Agoda) wrote:
>>> Hi,
>>>
>>> Trying to understand deeper and deeper these settings based on this article: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction but still not sure what would be the best option for my workload, maybe someone is familiar with this or have similar cluster as me.
>>>
>>> Issue:
>>>
>>>      *   During compaction I have slow ops, blocked io, laggy pg.
>>>      *   At the moment on the osds I already have 5 levels in the osd logs.
>>>      *   I'm using the basic settings which is: "compression=kNoCompression,    max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2"
>>>      *   I have 1-1.2 billions of 40kb objects in my cluster
>>>      *   Data is on host based ec 4:2 in a 7 nodes cluster, each node has 6x 15.3TB SAS ssd osd (no nvme for rocksdb)
>>>
>>> Mutiple configuration can be found on the internet but most commonly tuned:
>>>
>>>      *   compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
>>>
>>> So my question:
>>>
>>>      *   How I should tune this settings to speed it up in my workload?
>>>      *   Also is it enough an osd restart for this settings to be applied or need to recreate the osd?
>>>
>>> Istvan Szabo
>>> Senior Infrastructure Engineer
>>> ---------------------------------------------------
>>> Agoda Services Co., Ltd.
>>> e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
>>> ---------------------------------------------------
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
>>> email to ceph-users-leave@xxxxxxx
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
>> email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux