Re: ceph osd commit latency increase over time, until restart

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Fri, 15 Feb 2019 14:54:46 +0100 (CET)

>>Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>OSDs as well. Over time their latency increased until we started to 
>>notice I/O-wait inside VMs. 

I'm also notice it in the vms. BTW, what it your nvme disk size ?

>>A restart fixed it. We also increased memory target from 4G to 6G on 
>>these OSDs as the memory would allow it. 

I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.  
(my last test was 8gb with 1osd of 6TB, but that didn't help)

----- Mail original -----
De: "Wido den Hollander" <wido@xxxxxxxx>
À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Igor Fedotov" <ifedotov@xxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
Envoyé: Vendredi 15 Février 2019 14:50:34
Objet: Re:  ceph osd commit latency increase over time, until restart

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
> Thanks Igor. 
> 
> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. 
> 
> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. 
> 
> 

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
OSDs as well. Over time their latency increased until we started to 
notice I/O-wait inside VMs. 

A restart fixed it. We also increased memory target from 4G to 6G on 
these OSDs as the memory would allow it. 

But we noticed this on two different 12.2.10/11 clusters. 

A restart made the latency drop. Not only the numbers, but the 
real-world latency as experienced by a VM as well. 

Wido 

> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
> Envoyé: Vendredi 15 Février 2019 13:47:57 
> Objet: Re:  ceph osd commit latency increase over time, until restart 
> 
> Hi Alexander, 
> 
> I've read through your reports, nothing obvious so far. 
> 
> I can only see several times average latency increase for OSD write ops 
> (in seconds) 
> 0.002040060 (first hour) vs. 
> 
> 0.002483516 (last 24 hours) vs. 
> 0.008382087 (last hour) 
> 
> subop_w_latency: 
> 0.000478934 (first hour) vs. 
> 0.000537956 (last 24 hours) vs. 
> 0.003073475 (last hour) 
> 
> and OSD read ops, osd_r_latency: 
> 
> 0.000408595 (first hour) 
> 0.000709031 (24 hours) 
> 0.004979540 (last hour) 
> 
> What's interesting is that such latency differences aren't observed at 
> neither BlueStore level (any _lat params under "bluestore" section) nor 
> rocksdb one. 
> 
> Which probably means that the issue is rather somewhere above BlueStore. 
> 
> Suggest to proceed with perf dumps collection to see if the picture 
> stays the same. 
> 
> W.r.t. memory usage you observed I see nothing suspicious so far - No 
> decrease in RSS report is a known artifact that seems to be safe. 
> 
> Thanks, 
> Igor 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>> 
>> Thanks again for helping ! 
>> 
>> 
>> 
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>> 
>> 
>> I have done a lot of perf dump and mempool dump and ps of process to 
> see rss memory at different hours, 
>> here the reports for osd.0: 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/ 
>> 
>> 
>> osd has been started the 12-02-2019 at 08:00 
>> 
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>> 
>> 
>> 
>> report after 24 before counter resets 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>> 
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>> 
>> 
>> 
>> 
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
> around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>> 
>> 
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>> Then is decreasing over time (around 3,7G this morning), but RSS is 
> still at 8G 
>> 
>> 
>> I'm graphing mempools counters too since yesterday, so I'll able to 
> track them over time. 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx> 
>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" 
> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re:  ceph osd commit latency increase over time, 
> until restart 
>> 
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>> 
>>> Biggest difference: 
>>> 
>>> before restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other 
> take all the memory) 
>>> 
>>> 
>>> After restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> 
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>> 
>> What relates to allocator and most probably to fragmentation growth is : 
>> 
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>> 
>> which had been higher before the reset (if I got these dumps' order 
>> properly) 
>> 
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>> 
>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>> latency increase... 
>> 
>> Do you have perf counters dump after the restart? 
>> 
>> Could you collect some more dumps - for both mempool and perf counters? 
>> 
>> So ideally I'd like to have: 
>> 
>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>> 
>> 2) mempool/perf counters dumps in 24+ hours after restart 
>> 
>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>> restart) and dump mempool/perf counters again. 
>> 
>> So we'll be able to learn both allocator mem usage growth and operation 
>> latency distribution for the following periods: 
>> 
>> a) 1st hour after restart 
>> 
>> b) 25th hour. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>> 
>>> full mempool dump after restart 
>>> ------------------------------- 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 22225, 
>>> "bytes": 14935200 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 11, 
>>> "bytes": 8184 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 5047, 
>>> "bytes": 22673736 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 91, 
>>> "bytes": 1662976 
>>> }, 
>>> "bluefs": { 
>>> "items": 1907, 
>>> "bytes": 95600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 19664, 
>>> "bytes": 25486050 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 46189, 
>>> "bytes": 2956096 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 17, 
>>> "bytes": 214366 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 889673, 
>>> "bytes": 367160400 
>>> }, 
>>> "osdmap": { 
>>> "items": 3803, 
>>> "bytes": 224552 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 178515204, 
>>> "bytes": 2160630547 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@xxxxxxxxx> 
>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark 
> Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, 
> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" 
> <ceph-devel@xxxxxxxxxxxxxxx> 
>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>> Objet: Re:  ceph osd commit latency increase over time, 
> until restart 
>>> 
>>> I'm just seeing 
>>> 
>>> StupidAllocator::_aligned_len 
>>> and 
>>> 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, mempoo 
>>> 
>>> on 1 osd, both 10%. 
>>> 
>>> here the dump_mempools 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 105637, 
>>> "bytes": 70988064 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 12, 
>>> "bytes": 8928 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 406, 
>>> "bytes": 4792868 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 66, 
>>> "bytes": 1085440 
>>> }, 
>>> "bluefs": { 
>>> "items": 1882, 
>>> "bytes": 93600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 138986, 
>>> "bytes": 24983701 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 544, 
>>> "bytes": 34816 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 36, 
>>> "bytes": 179308 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 952564, 
>>> "bytes": 372459684 
>>> }, 
>>> "osdmap": { 
>>> "items": 3639, 
>>> "bytes": 224664 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 260109445, 
>>> "bytes": 2228370845 
>>> } 
>>> } 
>>> } 
>>> 
>>> 
>>> and the perf dump 
>>> 
>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>> { 
>>> "AsyncMessenger::Worker-0": { 
>>> "msgr_recv_messages": 22948570, 
>>> "msgr_send_messages": 22561570, 
>>> "msgr_recv_bytes": 333085080271, 
>>> "msgr_send_bytes": 261798871204, 
>>> "msgr_created_connections": 6152, 
>>> "msgr_active_connections": 2701, 
>>> "msgr_running_total_time": 1055.197867330, 
>>> "msgr_running_send_time": 352.764480121, 
>>> "msgr_running_recv_time": 499.206831955, 
>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>> }, 
>>> "AsyncMessenger::Worker-1": { 
>>> "msgr_recv_messages": 18801593, 
>>> "msgr_send_messages": 18430264, 
>>> "msgr_recv_bytes": 306871760934, 
>>> "msgr_send_bytes": 192789048666, 
>>> "msgr_created_connections": 5773, 
>>> "msgr_active_connections": 2721, 
>>> "msgr_running_total_time": 816.821076305, 
>>> "msgr_running_send_time": 261.353228926, 
>>> "msgr_running_recv_time": 394.035587911, 
>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>> }, 
>>> "AsyncMessenger::Worker-2": { 
>>> "msgr_recv_messages": 18463400, 
>>> "msgr_send_messages": 18105856, 
>>> "msgr_recv_bytes": 187425453590, 
>>> "msgr_send_bytes": 220735102555, 
>>> "msgr_created_connections": 5897, 
>>> "msgr_active_connections": 2605, 
>>> "msgr_running_total_time": 807.186854324, 
>>> "msgr_running_send_time": 296.834435839, 
>>> "msgr_running_recv_time": 351.364389691, 
>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>> }, 
>>> "bluefs": { 
>>> "gift_bytes": 0, 
>>> "reclaim_bytes": 0, 
>>> "db_total_bytes": 256050724864, 
>>> "db_used_bytes": 12413042688, 
>>> "wal_total_bytes": 0, 
>>> "wal_used_bytes": 0, 
>>> "slow_total_bytes": 0, 
>>> "slow_used_bytes": 0, 
>>> "num_files": 209, 
>>> "log_bytes": 10383360, 
>>> "log_compactions": 14, 
>>> "logged_bytes": 336498688, 
>>> "files_written_wal": 2, 
>>> "files_written_sst": 4499, 
>>> "bytes_written_wal": 417989099783, 
>>> "bytes_written_sst": 213188750209 
>>> }, 
>>> "bluestore": { 
>>> "kv_flush_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 26.734038497, 
>>> "avgtime": 0.000001013 
>>> }, 
>>> "kv_commit_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3397.491150603, 
>>> "avgtime": 0.000128829 
>>> }, 
>>> "kv_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3424.225189100, 
>>> "avgtime": 0.000129843 
>>> }, 
>>> "state_prepare_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3689.542105337, 
>>> "avgtime": 0.000121028 
>>> }, 
>>> "state_aio_wait_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 509.864546111, 
>>> "avgtime": 0.000016725 
>>> }, 
>>> "state_io_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 24.534052953, 
>>> "avgtime": 0.000000804 
>>> }, 
>>> "state_kv_queued_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3488.338424238, 
>>> "avgtime": 0.000114428 
>>> }, 
>>> "state_kv_commiting_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 5660.437003432, 
>>> "avgtime": 0.000185679 
>>> }, 
>>> "state_kv_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 7.763511500, 
>>> "avgtime": 0.000000254 
>>> }, 
>>> "state_deferred_queued_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 666071.296856696, 
>>> "avgtime": 0.025281557 
>>> }, 
>>> "state_deferred_aio_wait_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 1755.660547071, 
>>> "avgtime": 0.000066638 
>>> }, 
>>> "state_deferred_cleanup_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 185465.151653703, 
>>> "avgtime": 0.007039558 
>>> }, 
>>> "state_finishing_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 3.046847481, 
>>> "avgtime": 0.000000099 
>>> }, 
>>> "state_done_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 13193.362685280, 
>>> "avgtime": 0.000432783 
>>> }, 
>>> "throttle_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 14.634269979, 
>>> "avgtime": 0.000000480 
>>> }, 
>>> "submit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3873.883076148, 
>>> "avgtime": 0.000127075 
>>> }, 
>>> "commit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 13376.492317331, 
>>> "avgtime": 0.000438790 
>>> }, 
>>> "read_lat": { 
>>> "avgcount": 5873923, 
>>> "sum": 1817.167582057, 
>>> "avgtime": 0.000309361 
>>> }, 
>>> "read_onode_meta_lat": { 
>>> "avgcount": 19608201, 
>>> "sum": 146.770464482, 
>>> "avgtime": 0.000007485 
>>> }, 
>>> "read_wait_aio_lat": { 
>>> "avgcount": 13734278, 
>>> "sum": 2532.578077242, 
>>> "avgtime": 0.000184398 
>>> }, 
>>> "compress_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "decompress_lat": { 
>>> "avgcount": 1346945, 
>>> "sum": 26.227575896, 
>>> "avgtime": 0.000019471 
>>> }, 
>>> "csum_lat": { 
>>> "avgcount": 28020392, 
>>> "sum": 149.587819041, 
>>> "avgtime": 0.000005338 
>>> }, 
>>> "compress_success_count": 0, 
>>> "compress_rejected_count": 0, 
>>> "write_pad_bytes": 352923605, 
>>> "deferred_write_ops": 24373340, 
>>> "deferred_write_bytes": 216791842816, 
>>> "write_penalty_read_ops": 8062366, 
>>> "bluestore_allocated": 3765566013440, 
>>> "bluestore_stored": 4186255221852, 
>>> "bluestore_compressed": 39981379040, 
>>> "bluestore_compressed_allocated": 73748348928, 
>>> "bluestore_compressed_original": 165041381376, 
>>> "bluestore_onodes": 104232, 
>>> "bluestore_onode_hits": 71206874, 
>>> "bluestore_onode_misses": 1217914, 
>>> "bluestore_onode_shard_hits": 260183292, 
>>> "bluestore_onode_shard_misses": 22851573, 
>>> "bluestore_extents": 3394513, 
>>> "bluestore_blobs": 2773587, 
>>> "bluestore_buffers": 0, 
>>> "bluestore_buffer_bytes": 0, 
>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>> "bluestore_write_big": 5648815, 
>>> "bluestore_write_big_bytes": 552502214656, 
>>> "bluestore_write_big_blobs": 12440992, 
>>> "bluestore_write_small": 35883770, 
>>> "bluestore_write_small_bytes": 223436965719, 
>>> "bluestore_write_small_unused": 408125, 
>>> "bluestore_write_small_deferred": 34961455, 
>>> "bluestore_write_small_pre_read": 34961455, 
>>> "bluestore_write_small_new": 514190, 
>>> "bluestore_txc": 30484924, 
>>> "bluestore_onode_reshard": 5144189, 
>>> "bluestore_blob_split": 60104, 
>>> "bluestore_extent_compress": 53347252, 
>>> "bluestore_gc_merged": 21142528, 
>>> "bluestore_read_eio": 0, 
>>> "bluestore_fragmentation_micros": 67 
>>> }, 
>>> "finisher-defered_finisher": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "finisher-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 26625163, 
>>> "sum": 1057.506990951, 
>>> "avgtime": 0.000039718 
>>> } 
>>> }, 
>>> "finisher-objecter-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "objecter": { 
>>> "op_active": 0, 
>>> "op_laggy": 0, 
>>> "op_send": 0, 
>>> "op_send_bytes": 0, 
>>> "op_resend": 0, 
>>> "op_reply": 0, 
>>> "op": 0, 
>>> "op_r": 0, 
>>> "op_w": 0, 
>>> "op_rmw": 0, 
>>> "op_pg": 0, 
>>> "osdop_stat": 0, 
>>> "osdop_create": 0, 
>>> "osdop_read": 0, 
>>> "osdop_write": 0, 
>>> "osdop_writefull": 0, 
>>> "osdop_writesame": 0, 
>>> "osdop_append": 0, 
>>> "osdop_zero": 0, 
>>> "osdop_truncate": 0, 
>>> "osdop_delete": 0, 
>>> "osdop_mapext": 0, 
>>> "osdop_sparse_read": 0, 
>>> "osdop_clonerange": 0, 
>>> "osdop_getxattr": 0, 
>>> "osdop_setxattr": 0, 
>>> "osdop_cmpxattr": 0, 
>>> "osdop_rmxattr": 0, 
>>> "osdop_resetxattrs": 0, 
>>> "osdop_tmap_up": 0, 
>>> "osdop_tmap_put": 0, 
>>> "osdop_tmap_get": 0, 
>>> "osdop_call": 0, 
>>> "osdop_watch": 0, 
>>> "osdop_notify": 0, 
>>> "osdop_src_cmpxattr": 0, 
>>> "osdop_pgls": 0, 
>>> "osdop_pgls_filter": 0, 
>>> "osdop_other": 0, 
>>> "linger_active": 0, 
>>> "linger_send": 0, 
>>> "linger_resend": 0, 
>>> "linger_ping": 0, 
>>> "poolop_active": 0, 
>>> "poolop_send": 0, 
>>> "poolop_resend": 0, 
>>> "poolstat_active": 0, 
>>> "poolstat_send": 0, 
>>> "poolstat_resend": 0, 
>>> "statfs_active": 0, 
>>> "statfs_send": 0, 
>>> "statfs_resend": 0, 
>>> "command_active": 0, 
>>> "command_send": 0, 
>>> "command_resend": 0, 
>>> "map_epoch": 105913, 
>>> "map_full": 0, 
>>> "map_inc": 828, 
>>> "osd_sessions": 0, 
>>> "osd_session_open": 0, 
>>> "osd_session_close": 0, 
>>> "osd_laggy": 0, 
>>> "omap_wr": 0, 
>>> "omap_rd": 0, 
>>> "omap_del": 0 
>>> }, 
>>> "osd": { 
>>> "op_wip": 0, 
>>> "op": 16758102, 
>>> "op_in_bytes": 238398820586, 
>>> "op_out_bytes": 165484999463, 
>>> "op_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 38242.481640842, 
>>> "avgtime": 0.002282029 
>>> }, 
>>> "op_process_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 28644.906310687, 
>>> "avgtime": 0.001709316 
>>> }, 
>>> "op_prepare_latency": { 
>>> "avgcount": 16761367, 
>>> "sum": 3489.856599934, 
>>> "avgtime": 0.000208208 
>>> }, 
>>> "op_r": 6188565, 
>>> "op_r_out_bytes": 165484999463, 
>>> "op_r_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 4507.365756792, 
>>> "avgtime": 0.000728337 
>>> }, 
>>> "op_r_process_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 942.363063429, 
>>> "avgtime": 0.000152274 
>>> }, 
>>> "op_r_prepare_latency": { 
>>> "avgcount": 6188644, 
>>> "sum": 982.866710389, 
>>> "avgtime": 0.000158817 
>>> }, 
>>> "op_w": 10546037, 
>>> "op_w_in_bytes": 238334329494, 
>>> "op_w_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 33160.719998316, 
>>> "avgtime": 0.003144377 
>>> }, 
>>> "op_w_process_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 27668.702029030, 
>>> "avgtime": 0.002623611 
>>> }, 
>>> "op_w_prepare_latency": { 
>>> "avgcount": 10548652, 
>>> "sum": 2499.688609173, 
>>> "avgtime": 0.000236967 
>>> }, 
>>> "op_rw": 23500, 
>>> "op_rw_in_bytes": 64491092, 
>>> "op_rw_out_bytes": 0, 
>>> "op_rw_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 574.395885734, 
>>> "avgtime": 0.024442378 
>>> }, 
>>> "op_rw_process_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 33.841218228, 
>>> "avgtime": 0.001440051 
>>> }, 
>>> "op_rw_prepare_latency": { 
>>> "avgcount": 24071, 
>>> "sum": 7.301280372, 
>>> "avgtime": 0.000303322 
>>> }, 
>>> "op_before_queue_op_lat": { 
>>> "avgcount": 57892986, 
>>> "sum": 1502.117718889, 
>>> "avgtime": 0.000025946 
>>> }, 
>>> "op_before_dequeue_op_lat": { 
>>> "avgcount": 58091683, 
>>> "sum": 45194.453254037, 
>>> "avgtime": 0.000777984 
>>> }, 
>>> "subop": 19784758, 
>>> "subop_in_bytes": 547174969754, 
>>> "subop_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_w": 19784758, 
>>> "subop_w_in_bytes": 547174969754, 
>>> "subop_w_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_pull": 0, 
>>> "subop_pull_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "subop_push": 0, 
>>> "subop_push_in_bytes": 0, 
>>> "subop_push_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "pull": 0, 
>>> "push": 2003, 
>>> "push_out_bytes": 5560009728, 
>>> "recovery_ops": 1940, 
>>> "loadavg": 118, 
>>> "buffer_bytes": 0, 
>>> "history_alloc_Mbytes": 0, 
>>> "history_alloc_num": 0, 
>>> "cached_crc": 0, 
>>> "cached_crc_adjusted": 0, 
>>> "missed_crc": 0, 
>>> "numpg": 243, 
>>> "numpg_primary": 82, 
>>> "numpg_replica": 161, 
>>> "numpg_stray": 0, 
>>> "numpg_removing": 0, 
>>> "heartbeat_to_peers": 10, 
>>> "map_messages": 7013, 
>>> "map_message_epochs": 7143, 
>>> "map_message_epoch_dups": 6315, 
>>> "messages_delayed_for_map": 0, 
>>> "osd_map_cache_hit": 203309, 
>>> "osd_map_cache_miss": 33, 
>>> "osd_map_cache_miss_low": 0, 
>>> "osd_map_cache_miss_low_avg": { 
>>> "avgcount": 0, 
>>> "sum": 0 
>>> }, 
>>> "osd_map_bl_cache_hit": 47012, 
>>> "osd_map_bl_cache_miss": 1681, 
>>> "stat_bytes": 6401248198656, 
>>> "stat_bytes_used": 3777979072512, 
>>> "stat_bytes_avail": 2623269126144, 
>>> "copyfrom": 0, 
>>> "tier_promote": 0, 
>>> "tier_flush": 0, 
>>> "tier_flush_fail": 0, 
>>> "tier_try_flush": 0, 
>>> "tier_try_flush_fail": 0, 
>>> "tier_evict": 0, 
>>> "tier_whiteout": 1631, 
>>> "tier_dirty": 22360, 
>>> "tier_clean": 0, 
>>> "tier_delay": 0, 
>>> "tier_proxy_read": 0, 
>>> "tier_proxy_write": 0, 
>>> "agent_wake": 0, 
>>> "agent_skip": 0, 
>>> "agent_flush": 0, 
>>> "agent_evict": 0, 
>>> "object_ctx_cache_hit": 16311156, 
>>> "object_ctx_cache_total": 17426393, 
>>> "op_cache_hit": 0, 
>>> "osd_tier_flush_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_promote_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_r_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_pg_info": 30483113, 
>>> "osd_pg_fastinfo": 29619885, 
>>> "osd_pg_biginfo": 81703 
>>> }, 
>>> "recoverystate_perf": { 
>>> "initial_latency": { 
>>> "avgcount": 243, 
>>> "sum": 6.869296500, 
>>> "avgtime": 0.028268709 
>>> }, 
>>> "started_latency": { 
>>> "avgcount": 1125, 
>>> "sum": 13551384.917335850, 
>>> "avgtime": 12045.675482076 
>>> }, 
>>> "reset_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 1101.727799040, 
>>> "avgtime": 0.805356578 
>>> }, 
>>> "start_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 0.002014799, 
>>> "avgtime": 0.000001472 
>>> }, 
>>> "primary_latency": { 
>>> "avgcount": 507, 
>>> "sum": 4575560.638823428, 
>>> "avgtime": 9024.774435549 
>>> }, 
>>> "peering_latency": { 
>>> "avgcount": 550, 
>>> "sum": 499.372283616, 
>>> "avgtime": 0.907949606 
>>> }, 
>>> "backfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitremotebackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitlocalbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "notbackfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "repnotrecovering_latency": { 
>>> "avgcount": 1009, 
>>> "sum": 8975301.082274411, 
>>> "avgtime": 8895.243887288 
>>> }, 
>>> "repwaitrecoveryreserved_latency": { 
>>> "avgcount": 420, 
>>> "sum": 99.846056520, 
>>> "avgtime": 0.237728706 
>>> }, 
>>> "repwaitbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "reprecovering_latency": { 
>>> "avgcount": 420, 
>>> "sum": 241.682764382, 
>>> "avgtime": 0.575435153 
>>> }, 
>>> "activating_latency": { 
>>> "avgcount": 507, 
>>> "sum": 16.893347339, 
>>> "avgtime": 0.033320211 
>>> }, 
>>> "waitlocalrecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 672.335512769, 
>>> "avgtime": 3.378570415 
>>> }, 
>>> "waitremoterecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 213.536439363, 
>>> "avgtime": 1.073047433 
>>> }, 
>>> "recovering_latency": { 
>>> "avgcount": 199, 
>>> "sum": 79.007696479, 
>>> "avgtime": 0.397023600 
>>> }, 
>>> "recovered_latency": { 
>>> "avgcount": 507, 
>>> "sum": 14.000732748, 
>>> "avgtime": 0.027614857 
>>> }, 
>>> "clean_latency": { 
>>> "avgcount": 395, 
>>> "sum": 4574325.900371083, 
>>> "avgtime": 11580.571899673 
>>> }, 
>>> "active_latency": { 
>>> "avgcount": 425, 
>>> "sum": 4575107.630123680, 
>>> "avgtime": 10764.959129702 
>>> }, 
>>> "replicaactive_latency": { 
>>> "avgcount": 589, 
>>> "sum": 8975184.499049954, 
>>> "avgtime": 15238.004242869 
>>> }, 
>>> "stray_latency": { 
>>> "avgcount": 818, 
>>> "sum": 800.729455666, 
>>> "avgtime": 0.978886865 
>>> }, 
>>> "getinfo_latency": { 
>>> "avgcount": 550, 
>>> "sum": 15.085667048, 
>>> "avgtime": 0.027428485 
>>> }, 
>>> "getlog_latency": { 
>>> "avgcount": 546, 
>>> "sum": 3.482175693, 
>>> "avgtime": 0.006377611 
>>> }, 
>>> "waitactingchange_latency": { 
>>> "avgcount": 39, 
>>> "sum": 35.444551284, 
>>> "avgtime": 0.908834648 
>>> }, 
>>> "incomplete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "down_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "getmissing_latency": { 
>>> "avgcount": 507, 
>>> "sum": 6.702129624, 
>>> "avgtime": 0.013219190 
>>> }, 
>>> "waitupthru_latency": { 
>>> "avgcount": 507, 
>>> "sum": 474.098261727, 
>>> "avgtime": 0.935105052 
>>> }, 
>>> "notrecovering_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "rocksdb": { 
>>> "get": 28320977, 
>>> "submit_transaction": 30484924, 
>>> "submit_transaction_sync": 26371957, 
>>> "get_latency": { 
>>> "avgcount": 28320977, 
>>> "sum": 325.900908733, 
>>> "avgtime": 0.000011507 
>>> }, 
>>> "submit_latency": { 
>>> "avgcount": 30484924, 
>>> "sum": 1835.888692371, 
>>> "avgtime": 0.000060222 
>>> }, 
>>> "submit_sync_latency": { 
>>> "avgcount": 26371957, 
>>> "sum": 1431.555230628, 
>>> "avgtime": 0.000054283 
>>> }, 
>>> "compact": 0, 
>>> "compact_range": 0, 
>>> "compact_queue_merge": 0, 
>>> "compact_queue_len": 0, 
>>> "rocksdb_write_wal_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_memtable_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_delay_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_pre_and_post_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>> À: "aderumier" <aderumier@xxxxxxxxx> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark 
> Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, 
> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" 
> <ceph-devel@xxxxxxxxxxxxxxx> 
>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>> Objet: Re:  ceph osd commit latency increase over time, 
> until restart 
>>> 
>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>> ok, this is the same 
>>>> 
>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>> "How fragmented bluestore free space is (free extents / max 
> possible number of free extents) * 1000"); 
>>>> 
>>>> 
>>>> Here a graph on last month, with bluestore_fragmentation_micros and 
> latency, 
>>>> 
>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>> it? The same for other OSDs? 
>>> 
>>> This proves some issue with the allocator - generally fragmentation 
>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>> aren't properly merged in run-time. 
>>> 
>>> On the other side I'm not completely sure that latency degradation is 
>>> caused by that - fragmentation growth is relatively small - I don't see 
>>> how this might impact performance that high. 
>>> 
>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>> output on admin socket) reports? Do you have any historic data? 
>>> 
>>> If not may I have current output and say a couple more samples with 
>>> 8-12 hours interval? 
>>> 
>>> 
>>> Wrt to backporting bitmap allocator to mimic - we haven't had such 
> plans 
>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> 
>>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark 
> Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, 
> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" 
> <ceph-devel@xxxxxxxxxxxxxxx> 
>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>> Objet: Re:  ceph osd commit latency increase over time, 
> until restart 
>>>> 
>>>> Thanks Igor, 
>>>> 
>>>>>> Could you please collect BlueStore performance counters right 
> after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> I'm already monitoring with 
>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all 
> counters) 
>>>> 
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>> 
>>>> (but I have bluestore_fragmentation_micros) 
>>>> 
>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's 
> paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>> But I have a test cluster, maybe I can try to put some load on it, 
> and try to reproduce. 
>>>> 
>>>> 
>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from 
> Nautilus 
>>>>>> and try the difference... 
>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>> perf results of new bitmap allocator seem very promising from what 
> I've seen in PR. 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Stefan Priebe, 
> Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx> 
>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" 
> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>> Objet: Re:  ceph osd commit latency increase over time, 
> until restart 
>>>> 
>>>> Hi Alexandre, 
>>>> 
>>>> looks like a bug in StupidAllocator. 
>>>> 
>>>> Could you please collect BlueStore performance counters right after 
> OSD 
>>>> startup and once you get high latency. 
>>>> 
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> 
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>>>> 
>>>> 
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>> Hi again, 
>>>>> 
>>>>> I speak too fast, the problem has occured again, so it's not 
> tcmalloc cache size related. 
>>>>> 
>>>>> 
>>>>> I have notice something using a simple "perf top", 
>>>>> 
>>>>> each time I have this problem (I have seen exactly 4 times the 
> same behaviour), 
>>>>> 
>>>>> when latency is bad, perf top give me : 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
> unsigned long>&, std::pair<unsigned long 
>>>>> const, unsigned long>*>::increment_slow() 
>>>>> 
>>>>> (around 10-20% time for both) 
>>>>> 
>>>>> 
>>>>> when latency is good, I don't see them at all. 
>>>>> 
>>>>> 
>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>> 
>>>>> 
>>>>> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
>>>>> 
>>>>> 
>>>>> + 100.00% clone 
>>>>> + 100.00% start_thread 
>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*) 
>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
> ThreadPool::TPHandle&) 
>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, 
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>> | + 70.00% 
> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
> ThreadPool::TPHandle&) 
>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% 
> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> boost::intrusive_ptr<OpRequest>) 
>>>>> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
> std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>> | | | + 66.00% 
> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*) 
>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>&, 
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 65.00% 
> BlueStore::_do_alloc_write(BlueStore::TransContext*, 
> boost::intrusive_ptr<BlueStore::Collection>, 
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, 
> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, 
> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, 
> unsigned long, long, unsigned long*, unsigned int*) 
>>>>> | | | | | | + 34.00% 
> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned 
> long, unsigned long, std::less<unsigned long>, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >, std::pair<unsigned long const, 
> unsigned long>&, std::pair<unsigned long const, unsigned 
> long>*>::increment_slow() 
>>>>> | | | | | | + 26.00% 
> StupidAllocator::_aligned_len(interval_set<unsigned long, 
> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, 
> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned 
> long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" 
> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>> Objet: Re:  ceph osd commit latency increase over 
> time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> some news: 
>>>>> 
>>>>> I have tried with different transparent hugepage values (madvise, 
> never) : no change 
>>>>> 
>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>> 
>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 
> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait 
> some more days to be sure) 
>>>>> 
>>>>> 
>>>>> Note that this behaviour seem to happen really faster (< 2 days) 
> on my big nvme drives (6TB), 
>>>>> my others clusters user 1,6TB ssd. 
>>>>> 
>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 
> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if 
> it's helping. 
>>>>> 
>>>>> 
>>>>> BTW, does somebody have already tested ceph without tcmalloc, with 
> glibc >= 2.26 (which have also thread cache) ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@xxxxxxxxx> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" 
> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>> Objet: Re:  ceph osd commit latency increase over 
> time, until restart 
>>>>> 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not 
> op_r_process_latency? 
>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot 
> of graphs). 
>>>>> 
>>>>> I just don't see latency difference on reads. (or they are very 
> very small vs the write latency increase) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>> À: "aderumier" <aderumier@xxxxxxxxx> 
>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" 
> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>> Objet: Re:  ceph osd commit latency increase over 
> time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>> Hi Stefan, 
>>>>>> 
>>>>>>>> currently i'm in the process of switching back from jemalloc to 
> tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my 
> change. 
>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>> I need to compare with bigger latencies 
>>>>>> 
>>>>>> here an example, when all osd at 20-50ms before restart, then 
> after restart (at 21:15), 1ms 
>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>> 
>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. 
> Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> here my influxdb queries: 
>>>>>> 
>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 
> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
> GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM 
> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) 
> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" 
> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>> 
>>>>> Also why do you monitor op_w_process_latency? but not 
> op_r_process_latency? 
>>>>> 
>>>>> greets, 
>>>>> Stefan 
>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>>> À: "aderumier" <aderumier@xxxxxxxxx>, "Sage Weil" 
> <sage@xxxxxxxxxxxx> 
>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" 
> <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>> Objet: Re:  ceph osd commit latency increase over 
> time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> here some new results, 
>>>>>>> different osd/ different cluster 
>>>>>>> 
>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>> 
>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, 
> but maybe I'm wrong. 
>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>> currently i'm in the process of switching back from jemalloc to 
> tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my 
> change. 
>>>>>> 
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> 
>>>>>> I would like to check if i see the same behaviour. 
>>>>>> 
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Sage Weil" <sage@xxxxxxxxxxxx> 
>>>>>>> À: "aderumier" <aderumier@xxxxxxxxx> 
>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" 
> <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>> Objet: Re: ceph osd commit latency increase over time, until 
> restart 
>>>>>>> 
>>>>>>> Can you capture a perf top or perf record to see where teh CPU 
> time is 
>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>> 
>>>>>>> Thanks! 
>>>>>>> sage 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>> 
>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or 
> nvme drivers, 
>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + 
> snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>> 
>>>>>>>> When the osd are refreshly started, the commit latency is 
> between 0,5-1ms. 
>>>>>>>> 
>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by 
> day), until reaching crazy 
>>>>>>>> values like 20-200ms. 
>>>>>>>> 
>>>>>>>> Some example graphs: 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>> 
>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>> 
>>>>>>>> The latency of physical disks is ok. (Clusters are far to be 
> full loaded) 
>>>>>>>> 
>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>> 
>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a 
> bluestore memory bug ? 
>>>>>>>> 
>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards, 
>>>>>>>> 
>>>>>>>> Alexandre 
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> ceph-users mailing list 
>>>>>>> ceph-users@xxxxxxxxxxxxxx 
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>> 
>>> 
>> 
>> 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>> 
>> Thanks again for helping ! 
>> 
>> 
>> 
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>> 
>> 
>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, 
>> here the reports for osd.0: 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/ 
>> 
>> 
>> osd has been started the 12-02-2019 at 08:00 
>> 
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>> 
>> 
>> 
>> report after 24 before counter resets 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>> 
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>> 
>> 
>> 
>> 
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>> 
>> 
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt 
>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G 
>> 
>> 
>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. 
>> 
>> ----- Mail original ----- 
>> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx> 
>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>> 
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>> 
>>> Biggest difference: 
>>> 
>>> before restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) 
>>> 
>>> 
>>> After restart 
>>> ------------- 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> 
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>> 
>> What relates to allocator and most probably to fragmentation growth is : 
>> 
>> "bluestore_alloc": { 
>> "items": 165053952, 
>> "bytes": 165053952 
>> }, 
>> 
>> which had been higher before the reset (if I got these dumps' order 
>> properly) 
>> 
>> "bluestore_alloc": { 
>> "items": 210243456, 
>> "bytes": 210243456 
>> }, 
>> 
>> But as I mentioned - I'm not 100% sure this might cause such a huge 
>> latency increase... 
>> 
>> Do you have perf counters dump after the restart? 
>> 
>> Could you collect some more dumps - for both mempool and perf counters? 
>> 
>> So ideally I'd like to have: 
>> 
>> 1) mempool/perf counters dumps after the restart (1hour is OK) 
>> 
>> 2) mempool/perf counters dumps in 24+ hours after restart 
>> 
>> 3) reset perf counters after 2), wait for 1 hour (and without OSD 
>> restart) and dump mempool/perf counters again. 
>> 
>> So we'll be able to learn both allocator mem usage growth and operation 
>> latency distribution for the following periods: 
>> 
>> a) 1st hour after restart 
>> 
>> b) 25th hour. 
>> 
>> 
>> Thanks, 
>> 
>> Igor 
>> 
>> 
>>> full mempool dump after restart 
>>> ------------------------------- 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 165053952, 
>>> "bytes": 165053952 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 22225, 
>>> "bytes": 14935200 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 11, 
>>> "bytes": 8184 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 5047, 
>>> "bytes": 22673736 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 91, 
>>> "bytes": 1662976 
>>> }, 
>>> "bluefs": { 
>>> "items": 1907, 
>>> "bytes": 95600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 19664, 
>>> "bytes": 25486050 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 46189, 
>>> "bytes": 2956096 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 17, 
>>> "bytes": 214366 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 889673, 
>>> "bytes": 367160400 
>>> }, 
>>> "osdmap": { 
>>> "items": 3803, 
>>> "bytes": 224552 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 178515204, 
>>> "bytes": 2160630547 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "aderumier" <aderumier@xxxxxxxxx> 
>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>> Envoyé: Vendredi 8 Février 2019 16:14:54 
>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>> 
>>> I'm just seeing 
>>> 
>>> StupidAllocator::_aligned_len 
>>> and 
>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>> 
>>> on 1 osd, both 10%. 
>>> 
>>> here the dump_mempools 
>>> 
>>> { 
>>> "mempool": { 
>>> "by_pool": { 
>>> "bloom_filter": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_alloc": { 
>>> "items": 210243456, 
>>> "bytes": 210243456 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> "bluestore_cache_onode": { 
>>> "items": 105637, 
>>> "bytes": 70988064 
>>> }, 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_fsck": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "bluestore_txc": { 
>>> "items": 12, 
>>> "bytes": 8928 
>>> }, 
>>> "bluestore_writing_deferred": { 
>>> "items": 406, 
>>> "bytes": 4792868 
>>> }, 
>>> "bluestore_writing": { 
>>> "items": 66, 
>>> "bytes": 1085440 
>>> }, 
>>> "bluefs": { 
>>> "items": 1882, 
>>> "bytes": 93600 
>>> }, 
>>> "buffer_anon": { 
>>> "items": 138986, 
>>> "bytes": 24983701 
>>> }, 
>>> "buffer_meta": { 
>>> "items": 544, 
>>> "bytes": 34816 
>>> }, 
>>> "osd": { 
>>> "items": 243, 
>>> "bytes": 3089016 
>>> }, 
>>> "osd_mapbl": { 
>>> "items": 36, 
>>> "bytes": 179308 
>>> }, 
>>> "osd_pglog": { 
>>> "items": 952564, 
>>> "bytes": 372459684 
>>> }, 
>>> "osdmap": { 
>>> "items": 3639, 
>>> "bytes": 224664 
>>> }, 
>>> "osdmap_mapping": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "pgmap": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "mds_co": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_1": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> }, 
>>> "unittest_2": { 
>>> "items": 0, 
>>> "bytes": 0 
>>> } 
>>> }, 
>>> "total": { 
>>> "items": 260109445, 
>>> "bytes": 2228370845 
>>> } 
>>> } 
>>> } 
>>> 
>>> 
>>> and the perf dump 
>>> 
>>> root@ceph5-2:~# ceph daemon osd.4 perf dump 
>>> { 
>>> "AsyncMessenger::Worker-0": { 
>>> "msgr_recv_messages": 22948570, 
>>> "msgr_send_messages": 22561570, 
>>> "msgr_recv_bytes": 333085080271, 
>>> "msgr_send_bytes": 261798871204, 
>>> "msgr_created_connections": 6152, 
>>> "msgr_active_connections": 2701, 
>>> "msgr_running_total_time": 1055.197867330, 
>>> "msgr_running_send_time": 352.764480121, 
>>> "msgr_running_recv_time": 499.206831955, 
>>> "msgr_running_fast_dispatch_time": 130.982201607 
>>> }, 
>>> "AsyncMessenger::Worker-1": { 
>>> "msgr_recv_messages": 18801593, 
>>> "msgr_send_messages": 18430264, 
>>> "msgr_recv_bytes": 306871760934, 
>>> "msgr_send_bytes": 192789048666, 
>>> "msgr_created_connections": 5773, 
>>> "msgr_active_connections": 2721, 
>>> "msgr_running_total_time": 816.821076305, 
>>> "msgr_running_send_time": 261.353228926, 
>>> "msgr_running_recv_time": 394.035587911, 
>>> "msgr_running_fast_dispatch_time": 104.012155720 
>>> }, 
>>> "AsyncMessenger::Worker-2": { 
>>> "msgr_recv_messages": 18463400, 
>>> "msgr_send_messages": 18105856, 
>>> "msgr_recv_bytes": 187425453590, 
>>> "msgr_send_bytes": 220735102555, 
>>> "msgr_created_connections": 5897, 
>>> "msgr_active_connections": 2605, 
>>> "msgr_running_total_time": 807.186854324, 
>>> "msgr_running_send_time": 296.834435839, 
>>> "msgr_running_recv_time": 351.364389691, 
>>> "msgr_running_fast_dispatch_time": 101.215776792 
>>> }, 
>>> "bluefs": { 
>>> "gift_bytes": 0, 
>>> "reclaim_bytes": 0, 
>>> "db_total_bytes": 256050724864, 
>>> "db_used_bytes": 12413042688, 
>>> "wal_total_bytes": 0, 
>>> "wal_used_bytes": 0, 
>>> "slow_total_bytes": 0, 
>>> "slow_used_bytes": 0, 
>>> "num_files": 209, 
>>> "log_bytes": 10383360, 
>>> "log_compactions": 14, 
>>> "logged_bytes": 336498688, 
>>> "files_written_wal": 2, 
>>> "files_written_sst": 4499, 
>>> "bytes_written_wal": 417989099783, 
>>> "bytes_written_sst": 213188750209 
>>> }, 
>>> "bluestore": { 
>>> "kv_flush_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 26.734038497, 
>>> "avgtime": 0.000001013 
>>> }, 
>>> "kv_commit_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3397.491150603, 
>>> "avgtime": 0.000128829 
>>> }, 
>>> "kv_lat": { 
>>> "avgcount": 26371957, 
>>> "sum": 3424.225189100, 
>>> "avgtime": 0.000129843 
>>> }, 
>>> "state_prepare_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3689.542105337, 
>>> "avgtime": 0.000121028 
>>> }, 
>>> "state_aio_wait_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 509.864546111, 
>>> "avgtime": 0.000016725 
>>> }, 
>>> "state_io_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 24.534052953, 
>>> "avgtime": 0.000000804 
>>> }, 
>>> "state_kv_queued_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3488.338424238, 
>>> "avgtime": 0.000114428 
>>> }, 
>>> "state_kv_commiting_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 5660.437003432, 
>>> "avgtime": 0.000185679 
>>> }, 
>>> "state_kv_done_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 7.763511500, 
>>> "avgtime": 0.000000254 
>>> }, 
>>> "state_deferred_queued_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 666071.296856696, 
>>> "avgtime": 0.025281557 
>>> }, 
>>> "state_deferred_aio_wait_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 1755.660547071, 
>>> "avgtime": 0.000066638 
>>> }, 
>>> "state_deferred_cleanup_lat": { 
>>> "avgcount": 26346134, 
>>> "sum": 185465.151653703, 
>>> "avgtime": 0.007039558 
>>> }, 
>>> "state_finishing_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 3.046847481, 
>>> "avgtime": 0.000000099 
>>> }, 
>>> "state_done_lat": { 
>>> "avgcount": 30484920, 
>>> "sum": 13193.362685280, 
>>> "avgtime": 0.000432783 
>>> }, 
>>> "throttle_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 14.634269979, 
>>> "avgtime": 0.000000480 
>>> }, 
>>> "submit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 3873.883076148, 
>>> "avgtime": 0.000127075 
>>> }, 
>>> "commit_lat": { 
>>> "avgcount": 30484924, 
>>> "sum": 13376.492317331, 
>>> "avgtime": 0.000438790 
>>> }, 
>>> "read_lat": { 
>>> "avgcount": 5873923, 
>>> "sum": 1817.167582057, 
>>> "avgtime": 0.000309361 
>>> }, 
>>> "read_onode_meta_lat": { 
>>> "avgcount": 19608201, 
>>> "sum": 146.770464482, 
>>> "avgtime": 0.000007485 
>>> }, 
>>> "read_wait_aio_lat": { 
>>> "avgcount": 13734278, 
>>> "sum": 2532.578077242, 
>>> "avgtime": 0.000184398 
>>> }, 
>>> "compress_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "decompress_lat": { 
>>> "avgcount": 1346945, 
>>> "sum": 26.227575896, 
>>> "avgtime": 0.000019471 
>>> }, 
>>> "csum_lat": { 
>>> "avgcount": 28020392, 
>>> "sum": 149.587819041, 
>>> "avgtime": 0.000005338 
>>> }, 
>>> "compress_success_count": 0, 
>>> "compress_rejected_count": 0, 
>>> "write_pad_bytes": 352923605, 
>>> "deferred_write_ops": 24373340, 
>>> "deferred_write_bytes": 216791842816, 
>>> "write_penalty_read_ops": 8062366, 
>>> "bluestore_allocated": 3765566013440, 
>>> "bluestore_stored": 4186255221852, 
>>> "bluestore_compressed": 39981379040, 
>>> "bluestore_compressed_allocated": 73748348928, 
>>> "bluestore_compressed_original": 165041381376, 
>>> "bluestore_onodes": 104232, 
>>> "bluestore_onode_hits": 71206874, 
>>> "bluestore_onode_misses": 1217914, 
>>> "bluestore_onode_shard_hits": 260183292, 
>>> "bluestore_onode_shard_misses": 22851573, 
>>> "bluestore_extents": 3394513, 
>>> "bluestore_blobs": 2773587, 
>>> "bluestore_buffers": 0, 
>>> "bluestore_buffer_bytes": 0, 
>>> "bluestore_buffer_hit_bytes": 62026011221, 
>>> "bluestore_buffer_miss_bytes": 995233669922, 
>>> "bluestore_write_big": 5648815, 
>>> "bluestore_write_big_bytes": 552502214656, 
>>> "bluestore_write_big_blobs": 12440992, 
>>> "bluestore_write_small": 35883770, 
>>> "bluestore_write_small_bytes": 223436965719, 
>>> "bluestore_write_small_unused": 408125, 
>>> "bluestore_write_small_deferred": 34961455, 
>>> "bluestore_write_small_pre_read": 34961455, 
>>> "bluestore_write_small_new": 514190, 
>>> "bluestore_txc": 30484924, 
>>> "bluestore_onode_reshard": 5144189, 
>>> "bluestore_blob_split": 60104, 
>>> "bluestore_extent_compress": 53347252, 
>>> "bluestore_gc_merged": 21142528, 
>>> "bluestore_read_eio": 0, 
>>> "bluestore_fragmentation_micros": 67 
>>> }, 
>>> "finisher-defered_finisher": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "finisher-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 26625163, 
>>> "sum": 1057.506990951, 
>>> "avgtime": 0.000039718 
>>> } 
>>> }, 
>>> "finisher-objecter-finisher-0": { 
>>> "queue_len": 0, 
>>> "complete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.0::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.1::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.2::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.3::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.4::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.5::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.6::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::sdata_wait_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "mutex-OSDShard.7::shard_lock": { 
>>> "wait": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "objecter": { 
>>> "op_active": 0, 
>>> "op_laggy": 0, 
>>> "op_send": 0, 
>>> "op_send_bytes": 0, 
>>> "op_resend": 0, 
>>> "op_reply": 0, 
>>> "op": 0, 
>>> "op_r": 0, 
>>> "op_w": 0, 
>>> "op_rmw": 0, 
>>> "op_pg": 0, 
>>> "osdop_stat": 0, 
>>> "osdop_create": 0, 
>>> "osdop_read": 0, 
>>> "osdop_write": 0, 
>>> "osdop_writefull": 0, 
>>> "osdop_writesame": 0, 
>>> "osdop_append": 0, 
>>> "osdop_zero": 0, 
>>> "osdop_truncate": 0, 
>>> "osdop_delete": 0, 
>>> "osdop_mapext": 0, 
>>> "osdop_sparse_read": 0, 
>>> "osdop_clonerange": 0, 
>>> "osdop_getxattr": 0, 
>>> "osdop_setxattr": 0, 
>>> "osdop_cmpxattr": 0, 
>>> "osdop_rmxattr": 0, 
>>> "osdop_resetxattrs": 0, 
>>> "osdop_tmap_up": 0, 
>>> "osdop_tmap_put": 0, 
>>> "osdop_tmap_get": 0, 
>>> "osdop_call": 0, 
>>> "osdop_watch": 0, 
>>> "osdop_notify": 0, 
>>> "osdop_src_cmpxattr": 0, 
>>> "osdop_pgls": 0, 
>>> "osdop_pgls_filter": 0, 
>>> "osdop_other": 0, 
>>> "linger_active": 0, 
>>> "linger_send": 0, 
>>> "linger_resend": 0, 
>>> "linger_ping": 0, 
>>> "poolop_active": 0, 
>>> "poolop_send": 0, 
>>> "poolop_resend": 0, 
>>> "poolstat_active": 0, 
>>> "poolstat_send": 0, 
>>> "poolstat_resend": 0, 
>>> "statfs_active": 0, 
>>> "statfs_send": 0, 
>>> "statfs_resend": 0, 
>>> "command_active": 0, 
>>> "command_send": 0, 
>>> "command_resend": 0, 
>>> "map_epoch": 105913, 
>>> "map_full": 0, 
>>> "map_inc": 828, 
>>> "osd_sessions": 0, 
>>> "osd_session_open": 0, 
>>> "osd_session_close": 0, 
>>> "osd_laggy": 0, 
>>> "omap_wr": 0, 
>>> "omap_rd": 0, 
>>> "omap_del": 0 
>>> }, 
>>> "osd": { 
>>> "op_wip": 0, 
>>> "op": 16758102, 
>>> "op_in_bytes": 238398820586, 
>>> "op_out_bytes": 165484999463, 
>>> "op_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 38242.481640842, 
>>> "avgtime": 0.002282029 
>>> }, 
>>> "op_process_latency": { 
>>> "avgcount": 16758102, 
>>> "sum": 28644.906310687, 
>>> "avgtime": 0.001709316 
>>> }, 
>>> "op_prepare_latency": { 
>>> "avgcount": 16761367, 
>>> "sum": 3489.856599934, 
>>> "avgtime": 0.000208208 
>>> }, 
>>> "op_r": 6188565, 
>>> "op_r_out_bytes": 165484999463, 
>>> "op_r_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 4507.365756792, 
>>> "avgtime": 0.000728337 
>>> }, 
>>> "op_r_process_latency": { 
>>> "avgcount": 6188565, 
>>> "sum": 942.363063429, 
>>> "avgtime": 0.000152274 
>>> }, 
>>> "op_r_prepare_latency": { 
>>> "avgcount": 6188644, 
>>> "sum": 982.866710389, 
>>> "avgtime": 0.000158817 
>>> }, 
>>> "op_w": 10546037, 
>>> "op_w_in_bytes": 238334329494, 
>>> "op_w_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 33160.719998316, 
>>> "avgtime": 0.003144377 
>>> }, 
>>> "op_w_process_latency": { 
>>> "avgcount": 10546037, 
>>> "sum": 27668.702029030, 
>>> "avgtime": 0.002623611 
>>> }, 
>>> "op_w_prepare_latency": { 
>>> "avgcount": 10548652, 
>>> "sum": 2499.688609173, 
>>> "avgtime": 0.000236967 
>>> }, 
>>> "op_rw": 23500, 
>>> "op_rw_in_bytes": 64491092, 
>>> "op_rw_out_bytes": 0, 
>>> "op_rw_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 574.395885734, 
>>> "avgtime": 0.024442378 
>>> }, 
>>> "op_rw_process_latency": { 
>>> "avgcount": 23500, 
>>> "sum": 33.841218228, 
>>> "avgtime": 0.001440051 
>>> }, 
>>> "op_rw_prepare_latency": { 
>>> "avgcount": 24071, 
>>> "sum": 7.301280372, 
>>> "avgtime": 0.000303322 
>>> }, 
>>> "op_before_queue_op_lat": { 
>>> "avgcount": 57892986, 
>>> "sum": 1502.117718889, 
>>> "avgtime": 0.000025946 
>>> }, 
>>> "op_before_dequeue_op_lat": { 
>>> "avgcount": 58091683, 
>>> "sum": 45194.453254037, 
>>> "avgtime": 0.000777984 
>>> }, 
>>> "subop": 19784758, 
>>> "subop_in_bytes": 547174969754, 
>>> "subop_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_w": 19784758, 
>>> "subop_w_in_bytes": 547174969754, 
>>> "subop_w_latency": { 
>>> "avgcount": 19784758, 
>>> "sum": 13019.714424060, 
>>> "avgtime": 0.000658067 
>>> }, 
>>> "subop_pull": 0, 
>>> "subop_pull_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "subop_push": 0, 
>>> "subop_push_in_bytes": 0, 
>>> "subop_push_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "pull": 0, 
>>> "push": 2003, 
>>> "push_out_bytes": 5560009728, 
>>> "recovery_ops": 1940, 
>>> "loadavg": 118, 
>>> "buffer_bytes": 0, 
>>> "history_alloc_Mbytes": 0, 
>>> "history_alloc_num": 0, 
>>> "cached_crc": 0, 
>>> "cached_crc_adjusted": 0, 
>>> "missed_crc": 0, 
>>> "numpg": 243, 
>>> "numpg_primary": 82, 
>>> "numpg_replica": 161, 
>>> "numpg_stray": 0, 
>>> "numpg_removing": 0, 
>>> "heartbeat_to_peers": 10, 
>>> "map_messages": 7013, 
>>> "map_message_epochs": 7143, 
>>> "map_message_epoch_dups": 6315, 
>>> "messages_delayed_for_map": 0, 
>>> "osd_map_cache_hit": 203309, 
>>> "osd_map_cache_miss": 33, 
>>> "osd_map_cache_miss_low": 0, 
>>> "osd_map_cache_miss_low_avg": { 
>>> "avgcount": 0, 
>>> "sum": 0 
>>> }, 
>>> "osd_map_bl_cache_hit": 47012, 
>>> "osd_map_bl_cache_miss": 1681, 
>>> "stat_bytes": 6401248198656, 
>>> "stat_bytes_used": 3777979072512, 
>>> "stat_bytes_avail": 2623269126144, 
>>> "copyfrom": 0, 
>>> "tier_promote": 0, 
>>> "tier_flush": 0, 
>>> "tier_flush_fail": 0, 
>>> "tier_try_flush": 0, 
>>> "tier_try_flush_fail": 0, 
>>> "tier_evict": 0, 
>>> "tier_whiteout": 1631, 
>>> "tier_dirty": 22360, 
>>> "tier_clean": 0, 
>>> "tier_delay": 0, 
>>> "tier_proxy_read": 0, 
>>> "tier_proxy_write": 0, 
>>> "agent_wake": 0, 
>>> "agent_skip": 0, 
>>> "agent_flush": 0, 
>>> "agent_evict": 0, 
>>> "object_ctx_cache_hit": 16311156, 
>>> "object_ctx_cache_total": 17426393, 
>>> "op_cache_hit": 0, 
>>> "osd_tier_flush_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_promote_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_tier_r_lat": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "osd_pg_info": 30483113, 
>>> "osd_pg_fastinfo": 29619885, 
>>> "osd_pg_biginfo": 81703 
>>> }, 
>>> "recoverystate_perf": { 
>>> "initial_latency": { 
>>> "avgcount": 243, 
>>> "sum": 6.869296500, 
>>> "avgtime": 0.028268709 
>>> }, 
>>> "started_latency": { 
>>> "avgcount": 1125, 
>>> "sum": 13551384.917335850, 
>>> "avgtime": 12045.675482076 
>>> }, 
>>> "reset_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 1101.727799040, 
>>> "avgtime": 0.805356578 
>>> }, 
>>> "start_latency": { 
>>> "avgcount": 1368, 
>>> "sum": 0.002014799, 
>>> "avgtime": 0.000001472 
>>> }, 
>>> "primary_latency": { 
>>> "avgcount": 507, 
>>> "sum": 4575560.638823428, 
>>> "avgtime": 9024.774435549 
>>> }, 
>>> "peering_latency": { 
>>> "avgcount": 550, 
>>> "sum": 499.372283616, 
>>> "avgtime": 0.907949606 
>>> }, 
>>> "backfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitremotebackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "waitlocalbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "notbackfilling_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "repnotrecovering_latency": { 
>>> "avgcount": 1009, 
>>> "sum": 8975301.082274411, 
>>> "avgtime": 8895.243887288 
>>> }, 
>>> "repwaitrecoveryreserved_latency": { 
>>> "avgcount": 420, 
>>> "sum": 99.846056520, 
>>> "avgtime": 0.237728706 
>>> }, 
>>> "repwaitbackfillreserved_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "reprecovering_latency": { 
>>> "avgcount": 420, 
>>> "sum": 241.682764382, 
>>> "avgtime": 0.575435153 
>>> }, 
>>> "activating_latency": { 
>>> "avgcount": 507, 
>>> "sum": 16.893347339, 
>>> "avgtime": 0.033320211 
>>> }, 
>>> "waitlocalrecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 672.335512769, 
>>> "avgtime": 3.378570415 
>>> }, 
>>> "waitremoterecoveryreserved_latency": { 
>>> "avgcount": 199, 
>>> "sum": 213.536439363, 
>>> "avgtime": 1.073047433 
>>> }, 
>>> "recovering_latency": { 
>>> "avgcount": 199, 
>>> "sum": 79.007696479, 
>>> "avgtime": 0.397023600 
>>> }, 
>>> "recovered_latency": { 
>>> "avgcount": 507, 
>>> "sum": 14.000732748, 
>>> "avgtime": 0.027614857 
>>> }, 
>>> "clean_latency": { 
>>> "avgcount": 395, 
>>> "sum": 4574325.900371083, 
>>> "avgtime": 11580.571899673 
>>> }, 
>>> "active_latency": { 
>>> "avgcount": 425, 
>>> "sum": 4575107.630123680, 
>>> "avgtime": 10764.959129702 
>>> }, 
>>> "replicaactive_latency": { 
>>> "avgcount": 589, 
>>> "sum": 8975184.499049954, 
>>> "avgtime": 15238.004242869 
>>> }, 
>>> "stray_latency": { 
>>> "avgcount": 818, 
>>> "sum": 800.729455666, 
>>> "avgtime": 0.978886865 
>>> }, 
>>> "getinfo_latency": { 
>>> "avgcount": 550, 
>>> "sum": 15.085667048, 
>>> "avgtime": 0.027428485 
>>> }, 
>>> "getlog_latency": { 
>>> "avgcount": 546, 
>>> "sum": 3.482175693, 
>>> "avgtime": 0.006377611 
>>> }, 
>>> "waitactingchange_latency": { 
>>> "avgcount": 39, 
>>> "sum": 35.444551284, 
>>> "avgtime": 0.908834648 
>>> }, 
>>> "incomplete_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "down_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "getmissing_latency": { 
>>> "avgcount": 507, 
>>> "sum": 6.702129624, 
>>> "avgtime": 0.013219190 
>>> }, 
>>> "waitupthru_latency": { 
>>> "avgcount": 507, 
>>> "sum": 474.098261727, 
>>> "avgtime": 0.935105052 
>>> }, 
>>> "notrecovering_latency": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> }, 
>>> "rocksdb": { 
>>> "get": 28320977, 
>>> "submit_transaction": 30484924, 
>>> "submit_transaction_sync": 26371957, 
>>> "get_latency": { 
>>> "avgcount": 28320977, 
>>> "sum": 325.900908733, 
>>> "avgtime": 0.000011507 
>>> }, 
>>> "submit_latency": { 
>>> "avgcount": 30484924, 
>>> "sum": 1835.888692371, 
>>> "avgtime": 0.000060222 
>>> }, 
>>> "submit_sync_latency": { 
>>> "avgcount": 26371957, 
>>> "sum": 1431.555230628, 
>>> "avgtime": 0.000054283 
>>> }, 
>>> "compact": 0, 
>>> "compact_range": 0, 
>>> "compact_queue_merge": 0, 
>>> "compact_queue_len": 0, 
>>> "rocksdb_write_wal_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_memtable_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_delay_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> }, 
>>> "rocksdb_write_pre_and_post_time": { 
>>> "avgcount": 0, 
>>> "sum": 0.000000000, 
>>> "avgtime": 0.000000000 
>>> } 
>>> } 
>>> } 
>>> 
>>> ----- Mail original ----- 
>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>> À: "aderumier" <aderumier@xxxxxxxxx> 
>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>> Envoyé: Mardi 5 Février 2019 18:56:51 
>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>> 
>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>>>>> but I don't see l_bluestore_fragmentation counter. 
>>>>>> (but I have bluestore_fragmentation_micros) 
>>>> ok, this is the same 
>>>> 
>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); 
>>>> 
>>>> 
>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
>>>> 
>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 
>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>> it? The same for other OSDs? 
>>> 
>>> This proves some issue with the allocator - generally fragmentation 
>>> might grow but it shouldn't reset on restart. Looks like some intervals 
>>> aren't properly merged in run-time. 
>>> 
>>> On the other side I'm not completely sure that latency degradation is 
>>> caused by that - fragmentation growth is relatively small - I don't see 
>>> how this might impact performance that high. 
>>> 
>>> Wondering if you have OSD mempool monitoring (dump_mempools command 
>>> output on admin socket) reports? Do you have any historic data? 
>>> 
>>> If not may I have current output and say a couple more samples with 
>>> 8-12 hours interval? 
>>> 
>>> 
>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
>>> before that but I'll discuss this at BlueStore meeting shortly. 
>>> 
>>> 
>>> Thanks, 
>>> 
>>> Igor 
>>> 
>>>> ----- Mail original ----- 
>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> 
>>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>> Envoyé: Lundi 4 Février 2019 16:04:38 
>>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Thanks Igor, 
>>>> 
>>>>>> Could you please collect BlueStore performance counters right after OSD 
>>>>>> startup and once you get high latency. 
>>>>>> 
>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> I'm already monitoring with 
>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
>>>> 
>>>> but I don't see l_bluestore_fragmentation counter. 
>>>> 
>>>> (but I have bluestore_fragmentation_micros) 
>>>> 
>>>> 
>>>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>>>> patch to track latency and some other internal allocator's paramter to 
>>>>>> make sure it's degraded and learn more details. 
>>>> Sorry, It's a critical production cluster, I can't test on it :( 
>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. 
>>>> 
>>>> 
>>>> 
>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>>>> and try the difference... 
>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. 
>>>> 
>>>> 
>>>> 
>>>> ----- Mail original ----- 
>>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> 
>>>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx> 
>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>> Envoyé: Lundi 4 Février 2019 15:51:30 
>>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>>> 
>>>> Hi Alexandre, 
>>>> 
>>>> looks like a bug in StupidAllocator. 
>>>> 
>>>> Could you please collect BlueStore performance counters right after OSD 
>>>> startup and once you get high latency. 
>>>> 
>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
>>>> 
>>>> Also if you're able to rebuild the code I can probably make a simple 
>>>> patch to track latency and some other internal allocator's paramter to 
>>>> make sure it's degraded and learn more details. 
>>>> 
>>>> 
>>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>>> and try the difference... 
>>>> 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Igor 
>>>> 
>>>> 
>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
>>>>> Hi again, 
>>>>> 
>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. 
>>>>> 
>>>>> 
>>>>> I have notice something using a simple "perf top", 
>>>>> 
>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), 
>>>>> 
>>>>> when latency is bad, perf top give me : 
>>>>> 
>>>>> StupidAllocator::_aligned_len 
>>>>> and 
>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo 
>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long 
>>>>> const, unsigned long>*>::increment_slow() 
>>>>> 
>>>>> (around 10-20% time for both) 
>>>>> 
>>>>> 
>>>>> when latency is good, I don't see them at all. 
>>>>> 
>>>>> 
>>>>> I have used the Mark wallclock profiler, here the results: 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
>>>>> 
>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
>>>>> 
>>>>> 
>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len 
>>>>> 
>>>>> 
>>>>> + 100.00% clone 
>>>>> + 100.00% start_thread 
>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) 
>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) 
>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) 
>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) 
>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) 
>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) 
>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) 
>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) 
>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) 
>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) 
>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) 
>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() 
>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 
>>>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> some news: 
>>>>> 
>>>>> I have tried with different transparent hugepage values (madvise, never) : no change 
>>>>> 
>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change 
>>>>> 
>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) 
>>>>> 
>>>>> 
>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), 
>>>>> my others clusters user 1,6TB ssd. 
>>>>> 
>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. 
>>>>> 
>>>>> 
>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? 
>>>>> 
>>>>> 
>>>>> Regards, 
>>>>> 
>>>>> Alexandre 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "aderumier" <aderumier@xxxxxxxxx> 
>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 
>>>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>>>> op_r_latency but instead op_latency? 
>>>>>>> 
>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 
>>>>> 
>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) 
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Mail original ----- 
>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>> À: "aderumier" <aderumier@xxxxxxxxx> 
>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 
>>>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>>>> 
>>>>> Hi, 
>>>>> 
>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
>>>>>> Hi Stefan, 
>>>>>> 
>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>> Well,I'm really not sure that it's a tcmalloc bug. 
>>>>>> maybe bluestore related (don't have filestore anymore to compare) 
>>>>>> I need to compare with bigger latencies 
>>>>>> 
>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms 
>>>>>> http://odisoweb1.odiso.net/latencybad.png 
>>>>>> 
>>>>>> I observe the latency in my guest vm too, on disks iowait. 
>>>>>> 
>>>>>> http://odisoweb1.odiso.net/latencybadvm.png 
>>>>>> 
>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> here my influxdb queries: 
>>>>>> 
>>>>>> It take op_latency.sum/op_latency.avgcount on last second. 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>>> 
>>>>>> 
>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
>>>>> Thanks. Is there any reason you monitor op_w_latency but not 
>>>>> op_r_latency but instead op_latency? 
>>>>> 
>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? 
>>>>> 
>>>>> greets, 
>>>>> Stefan 
>>>>> 
>>>>>> 
>>>>>> ----- Mail original ----- 
>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> 
>>>>>> À: "aderumier" <aderumier@xxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx> 
>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
>>>>>> Objet: Re:  ceph osd commit latency increase over time, until restart 
>>>>>> 
>>>>>> Hi, 
>>>>>> 
>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> here some new results, 
>>>>>>> different osd/ different cluster 
>>>>>>> 
>>>>>>> before osd restart latency was between 2-5ms 
>>>>>>> after osd restart is around 1-1.5ms 
>>>>>>> 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>>>>>>> 
>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>>>>>>> (I'm using tcmalloc 2.5-2.2) 
>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>>>>> like suggested. This report makes me a little nervous about my change. 
>>>>>> 
>>>>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>>>>> exact values out of the daemon do you use for bluestore? 
>>>>>> 
>>>>>> I would like to check if i see the same behaviour. 
>>>>>> 
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>>> ----- Mail original ----- 
>>>>>>> De: "Sage Weil" <sage@xxxxxxxxxxxx> 
>>>>>>> À: "aderumier" <aderumier@xxxxxxxxx> 
>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart 
>>>>>>> 
>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is 
>>>>>>> going on one of the OSDs wth a high latency? 
>>>>>>> 
>>>>>>> Thanks! 
>>>>>>> sage 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, 
>>>>>>>> 
>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>>>>>>> 
>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>>>>>>> 
>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>>>>>>> values like 20-200ms. 
>>>>>>>> 
>>>>>>>> Some example graphs: 
>>>>>>>> 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>>>>>>> 
>>>>>>>> All osds have this behaviour, in all clusters. 
>>>>>>>> 
>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>>>>>>> 
>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>>>>>>> 
>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>>>>>>> 
>>>>>>>> Any Hints for counters/logs to check ? 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards, 
>>>>>>>> 
>>>>>>>> Alexandre 
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________ 
>>>>>>> ceph-users mailing list 
>>>>>>> ceph-users@xxxxxxxxxxxxxx 
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com