>>I think op_w_process_latency includes replication times, not 100% sure >>though. >> >>So restarting other nodes might affect latencies at this specific OSD. Seem to be the case, I have compared with sub_op_latency. I have changed my graph, to clearly identify the osd where the latency is high. I have done some changes in my setup: - 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G memory). - disabling transparent hugepage Since 24h, latencies are still low (between 0.7-1.2ms). I'm also seeing that total memory used (#free), is lower than before (48GB (8osd x 6GB) vs 56GB (4osd x 12GB). I'll send more stats tomorrow. Alexandre ----- Mail original ----- De: "Igor Fedotov" <ifedotov@xxxxxxx> À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Wido den Hollander" <wido@xxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Mardi 19 Février 2019 11:12:43 Objet: Re: ceph osd commit latency increase over time, until restart Hi Alexander, I think op_w_process_latency includes replication times, not 100% sure though. So restarting other nodes might affect latencies at this specific OSD. Thanks, Igot On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: >>> There are 10 OSDs in these systems with 96GB of memory in total. We are >>> runnigh with memory target on 6G right now to make sure there is no >>> leakage. If this runs fine for a longer period we will go to 8GB per OSD >>> so it will max out on 80GB leaving 16GB as spare. > Thanks Wido. I send results monday with my increased memory > > > > @Igor: > > I have also notice, that sometime when I have bad latency on an osd on node1 (restarted 12h ago for example). > (op_w_process_latency). > > If I restart osds on other nodes (last restart some days ago, so with bigger latency), it's reducing latency on osd of node1 too. > > does "op_w_process_latency" counter include replication time ? > > ----- Mail original ----- > De: "Wido den Hollander" <wido@xxxxxxxx> > À: "aderumier" <aderumier@xxxxxxxxx> > Cc: "Igor Fedotov" <ifedotov@xxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> > Envoyé: Vendredi 15 Février 2019 14:59:30 > Objet: Re: ceph osd commit latency increase over time, until restart > > On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: >>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe >>>> OSDs as well. Over time their latency increased until we started to >>>> notice I/O-wait inside VMs. >> I'm also notice it in the vms. BTW, what it your nvme disk size ? > Samsung PM983 3.84TB SSDs in both clusters. > >> >>>> A restart fixed it. We also increased memory target from 4G to 6G on >>>> these OSDs as the memory would allow it. >> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. >> (my last test was 8gb with 1osd of 6TB, but that didn't help) > There are 10 OSDs in these systems with 96GB of memory in total. We are > runnigh with memory target on 6G right now to make sure there is no > leakage. If this runs fine for a longer period we will go to 8GB per OSD > so it will max out on 80GB leaving 16GB as spare. > > As these OSDs were all restarted earlier this week I can't tell how it > will hold up over a longer period. Monitoring (Zabbix) shows the latency > is fine at the moment. > > Wido > >> >> ----- Mail original ----- >> De: "Wido den Hollander" <wido@xxxxxxxx> >> À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Igor Fedotov" <ifedotov@xxxxxxx> >> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >> Envoyé: Vendredi 15 Février 2019 14:50:34 >> Objet: Re: ceph osd commit latency increase over time, until restart >> >> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: >>> Thanks Igor. >>> >>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. >>> >>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. >>> >>> >> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe >> OSDs as well. Over time their latency increased until we started to >> notice I/O-wait inside VMs. >> >> A restart fixed it. We also increased memory target from 4G to 6G on >> these OSDs as the memory would allow it. >> >> But we noticed this on two different 12.2.10/11 clusters. >> >> A restart made the latency drop. Not only the numbers, but the >> real-world latency as experienced by a VM as well. >> >> Wido >> >>> >>> >>> >>> >>> ----- Mail original ----- >>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>> Envoyé: Vendredi 15 Février 2019 13:47:57 >>> Objet: Re: ceph osd commit latency increase over time, until restart >>> >>> Hi Alexander, >>> >>> I've read through your reports, nothing obvious so far. >>> >>> I can only see several times average latency increase for OSD write ops >>> (in seconds) >>> 0.002040060 (first hour) vs. >>> >>> 0.002483516 (last 24 hours) vs. >>> 0.008382087 (last hour) >>> >>> subop_w_latency: >>> 0.000478934 (first hour) vs. >>> 0.000537956 (last 24 hours) vs. >>> 0.003073475 (last hour) >>> >>> and OSD read ops, osd_r_latency: >>> >>> 0.000408595 (first hour) >>> 0.000709031 (24 hours) >>> 0.004979540 (last hour) >>> >>> What's interesting is that such latency differences aren't observed at >>> neither BlueStore level (any _lat params under "bluestore" section) nor >>> rocksdb one. >>> >>> Which probably means that the issue is rather somewhere above BlueStore. >>> >>> Suggest to proceed with perf dumps collection to see if the picture >>> stays the same. >>> >>> W.r.t. memory usage you observed I see nothing suspicious so far - No >>> decrease in RSS report is a known artifact that seems to be safe. >>> >>> Thanks, >>> Igor >>> >>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: >>>> Hi Igor, >>>> >>>> Thanks again for helping ! >>>> >>>> >>>> >>>> I have upgrade to last mimic this weekend, and with new autotune memory, >>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) >>>> >>>> >>>> I have done a lot of perf dump and mempool dump and ps of process to >>> see rss memory at different hours, >>>> here the reports for osd.0: >>>> >>>> http://odisoweb1.odiso.net/perfanalysis/ >>>> >>>> >>>> osd has been started the 12-02-2019 at 08:00 >>>> >>>> first report after 1h running >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt >>>> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt >>>> >>>> >>>> >>>> report after 24 before counter resets >>>> >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt >>>> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt >>>> >>>> report 1h after counter reset >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt >>>> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt >>>> >>>> >>>> >>>> >>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G >>> around 12-02-2019 at 14:00 >>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png >>>> Then after that, slowly decreasing. >>>> >>>> >>>> Another strange thing, >>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 >>>> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt >>>> Then is decreasing over time (around 3,7G this morning), but RSS is >>> still at 8G >>>> >>>> I'm graphing mempools counters too since yesterday, so I'll able to >>> track them over time. >>>> ----- Mail original ----- >>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx> >>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" >>> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>> Envoyé: Lundi 11 Février 2019 12:03:17 >>>> Objet: Re: ceph osd commit latency increase over time, >>> until restart >>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: >>>>> another mempool dump after 1h run. (latency ok) >>>>> >>>>> Biggest difference: >>>>> >>>>> before restart >>>>> ------------- >>>>> "bluestore_cache_other": { >>>>> "items": 48661920, >>>>> "bytes": 1539544228 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 54, >>>>> "bytes": 643072 >>>>> }, >>>>> (other caches seem to be quite low too, like bluestore_cache_other >>> take all the memory) >>>>> >>>>> After restart >>>>> ------------- >>>>> "bluestore_cache_other": { >>>>> "items": 12432298, >>>>> "bytes": 500834899 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 40084, >>>>> "bytes": 1056235520 >>>>> }, >>>>> >>>> This is fine as cache is warming after restart and some rebalancing >>>> between data and metadata might occur. >>>> >>>> What relates to allocator and most probably to fragmentation growth is : >>>> >>>> "bluestore_alloc": { >>>> "items": 165053952, >>>> "bytes": 165053952 >>>> }, >>>> >>>> which had been higher before the reset (if I got these dumps' order >>>> properly) >>>> >>>> "bluestore_alloc": { >>>> "items": 210243456, >>>> "bytes": 210243456 >>>> }, >>>> >>>> But as I mentioned - I'm not 100% sure this might cause such a huge >>>> latency increase... >>>> >>>> Do you have perf counters dump after the restart? >>>> >>>> Could you collect some more dumps - for both mempool and perf counters? >>>> >>>> So ideally I'd like to have: >>>> >>>> 1) mempool/perf counters dumps after the restart (1hour is OK) >>>> >>>> 2) mempool/perf counters dumps in 24+ hours after restart >>>> >>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD >>>> restart) and dump mempool/perf counters again. >>>> >>>> So we'll be able to learn both allocator mem usage growth and operation >>>> latency distribution for the following periods: >>>> >>>> a) 1st hour after restart >>>> >>>> b) 25th hour. >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>>> full mempool dump after restart >>>>> ------------------------------- >>>>> >>>>> { >>>>> "mempool": { >>>>> "by_pool": { >>>>> "bloom_filter": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_alloc": { >>>>> "items": 165053952, >>>>> "bytes": 165053952 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 40084, >>>>> "bytes": 1056235520 >>>>> }, >>>>> "bluestore_cache_onode": { >>>>> "items": 22225, >>>>> "bytes": 14935200 >>>>> }, >>>>> "bluestore_cache_other": { >>>>> "items": 12432298, >>>>> "bytes": 500834899 >>>>> }, >>>>> "bluestore_fsck": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_txc": { >>>>> "items": 11, >>>>> "bytes": 8184 >>>>> }, >>>>> "bluestore_writing_deferred": { >>>>> "items": 5047, >>>>> "bytes": 22673736 >>>>> }, >>>>> "bluestore_writing": { >>>>> "items": 91, >>>>> "bytes": 1662976 >>>>> }, >>>>> "bluefs": { >>>>> "items": 1907, >>>>> "bytes": 95600 >>>>> }, >>>>> "buffer_anon": { >>>>> "items": 19664, >>>>> "bytes": 25486050 >>>>> }, >>>>> "buffer_meta": { >>>>> "items": 46189, >>>>> "bytes": 2956096 >>>>> }, >>>>> "osd": { >>>>> "items": 243, >>>>> "bytes": 3089016 >>>>> }, >>>>> "osd_mapbl": { >>>>> "items": 17, >>>>> "bytes": 214366 >>>>> }, >>>>> "osd_pglog": { >>>>> "items": 889673, >>>>> "bytes": 367160400 >>>>> }, >>>>> "osdmap": { >>>>> "items": 3803, >>>>> "bytes": 224552 >>>>> }, >>>>> "osdmap_mapping": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "pgmap": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "mds_co": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_1": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_2": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> } >>>>> }, >>>>> "total": { >>>>> "items": 178515204, >>>>> "bytes": 2160630547 >>>>> } >>>>> } >>>>> } >>>>> >>>>> ----- Mail original ----- >>>>> De: "aderumier" <aderumier@xxxxxxxxx> >>>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark >>> Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, >>> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" >>> <ceph-devel@xxxxxxxxxxxxxxx> >>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 >>>>> Objet: Re: ceph osd commit latency increase over time, >>> until restart >>>>> I'm just seeing >>>>> >>>>> StupidAllocator::_aligned_len >>>>> and >>>>> >>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned >>> long, unsigned long, std::less<unsigned long>, mempoo >>>>> on 1 osd, both 10%. >>>>> >>>>> here the dump_mempools >>>>> >>>>> { >>>>> "mempool": { >>>>> "by_pool": { >>>>> "bloom_filter": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_alloc": { >>>>> "items": 210243456, >>>>> "bytes": 210243456 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 54, >>>>> "bytes": 643072 >>>>> }, >>>>> "bluestore_cache_onode": { >>>>> "items": 105637, >>>>> "bytes": 70988064 >>>>> }, >>>>> "bluestore_cache_other": { >>>>> "items": 48661920, >>>>> "bytes": 1539544228 >>>>> }, >>>>> "bluestore_fsck": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_txc": { >>>>> "items": 12, >>>>> "bytes": 8928 >>>>> }, >>>>> "bluestore_writing_deferred": { >>>>> "items": 406, >>>>> "bytes": 4792868 >>>>> }, >>>>> "bluestore_writing": { >>>>> "items": 66, >>>>> "bytes": 1085440 >>>>> }, >>>>> "bluefs": { >>>>> "items": 1882, >>>>> "bytes": 93600 >>>>> }, >>>>> "buffer_anon": { >>>>> "items": 138986, >>>>> "bytes": 24983701 >>>>> }, >>>>> "buffer_meta": { >>>>> "items": 544, >>>>> "bytes": 34816 >>>>> }, >>>>> "osd": { >>>>> "items": 243, >>>>> "bytes": 3089016 >>>>> }, >>>>> "osd_mapbl": { >>>>> "items": 36, >>>>> "bytes": 179308 >>>>> }, >>>>> "osd_pglog": { >>>>> "items": 952564, >>>>> "bytes": 372459684 >>>>> }, >>>>> "osdmap": { >>>>> "items": 3639, >>>>> "bytes": 224664 >>>>> }, >>>>> "osdmap_mapping": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "pgmap": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "mds_co": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_1": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_2": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> } >>>>> }, >>>>> "total": { >>>>> "items": 260109445, >>>>> "bytes": 2228370845 >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> and the perf dump >>>>> >>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump >>>>> { >>>>> "AsyncMessenger::Worker-0": { >>>>> "msgr_recv_messages": 22948570, >>>>> "msgr_send_messages": 22561570, >>>>> "msgr_recv_bytes": 333085080271, >>>>> "msgr_send_bytes": 261798871204, >>>>> "msgr_created_connections": 6152, >>>>> "msgr_active_connections": 2701, >>>>> "msgr_running_total_time": 1055.197867330, >>>>> "msgr_running_send_time": 352.764480121, >>>>> "msgr_running_recv_time": 499.206831955, >>>>> "msgr_running_fast_dispatch_time": 130.982201607 >>>>> }, >>>>> "AsyncMessenger::Worker-1": { >>>>> "msgr_recv_messages": 18801593, >>>>> "msgr_send_messages": 18430264, >>>>> "msgr_recv_bytes": 306871760934, >>>>> "msgr_send_bytes": 192789048666, >>>>> "msgr_created_connections": 5773, >>>>> "msgr_active_connections": 2721, >>>>> "msgr_running_total_time": 816.821076305, >>>>> "msgr_running_send_time": 261.353228926, >>>>> "msgr_running_recv_time": 394.035587911, >>>>> "msgr_running_fast_dispatch_time": 104.012155720 >>>>> }, >>>>> "AsyncMessenger::Worker-2": { >>>>> "msgr_recv_messages": 18463400, >>>>> "msgr_send_messages": 18105856, >>>>> "msgr_recv_bytes": 187425453590, >>>>> "msgr_send_bytes": 220735102555, >>>>> "msgr_created_connections": 5897, >>>>> "msgr_active_connections": 2605, >>>>> "msgr_running_total_time": 807.186854324, >>>>> "msgr_running_send_time": 296.834435839, >>>>> "msgr_running_recv_time": 351.364389691, >>>>> "msgr_running_fast_dispatch_time": 101.215776792 >>>>> }, >>>>> "bluefs": { >>>>> "gift_bytes": 0, >>>>> "reclaim_bytes": 0, >>>>> "db_total_bytes": 256050724864, >>>>> "db_used_bytes": 12413042688, >>>>> "wal_total_bytes": 0, >>>>> "wal_used_bytes": 0, >>>>> "slow_total_bytes": 0, >>>>> "slow_used_bytes": 0, >>>>> "num_files": 209, >>>>> "log_bytes": 10383360, >>>>> "log_compactions": 14, >>>>> "logged_bytes": 336498688, >>>>> "files_written_wal": 2, >>>>> "files_written_sst": 4499, >>>>> "bytes_written_wal": 417989099783, >>>>> "bytes_written_sst": 213188750209 >>>>> }, >>>>> "bluestore": { >>>>> "kv_flush_lat": { >>>>> "avgcount": 26371957, >>>>> "sum": 26.734038497, >>>>> "avgtime": 0.000001013 >>>>> }, >>>>> "kv_commit_lat": { >>>>> "avgcount": 26371957, >>>>> "sum": 3397.491150603, >>>>> "avgtime": 0.000128829 >>>>> }, >>>>> "kv_lat": { >>>>> "avgcount": 26371957, >>>>> "sum": 3424.225189100, >>>>> "avgtime": 0.000129843 >>>>> }, >>>>> "state_prepare_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 3689.542105337, >>>>> "avgtime": 0.000121028 >>>>> }, >>>>> "state_aio_wait_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 509.864546111, >>>>> "avgtime": 0.000016725 >>>>> }, >>>>> "state_io_done_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 24.534052953, >>>>> "avgtime": 0.000000804 >>>>> }, >>>>> "state_kv_queued_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 3488.338424238, >>>>> "avgtime": 0.000114428 >>>>> }, >>>>> "state_kv_commiting_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 5660.437003432, >>>>> "avgtime": 0.000185679 >>>>> }, >>>>> "state_kv_done_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 7.763511500, >>>>> "avgtime": 0.000000254 >>>>> }, >>>>> "state_deferred_queued_lat": { >>>>> "avgcount": 26346134, >>>>> "sum": 666071.296856696, >>>>> "avgtime": 0.025281557 >>>>> }, >>>>> "state_deferred_aio_wait_lat": { >>>>> "avgcount": 26346134, >>>>> "sum": 1755.660547071, >>>>> "avgtime": 0.000066638 >>>>> }, >>>>> "state_deferred_cleanup_lat": { >>>>> "avgcount": 26346134, >>>>> "sum": 185465.151653703, >>>>> "avgtime": 0.007039558 >>>>> }, >>>>> "state_finishing_lat": { >>>>> "avgcount": 30484920, >>>>> "sum": 3.046847481, >>>>> "avgtime": 0.000000099 >>>>> }, >>>>> "state_done_lat": { >>>>> "avgcount": 30484920, >>>>> "sum": 13193.362685280, >>>>> "avgtime": 0.000432783 >>>>> }, >>>>> "throttle_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 14.634269979, >>>>> "avgtime": 0.000000480 >>>>> }, >>>>> "submit_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 3873.883076148, >>>>> "avgtime": 0.000127075 >>>>> }, >>>>> "commit_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 13376.492317331, >>>>> "avgtime": 0.000438790 >>>>> }, >>>>> "read_lat": { >>>>> "avgcount": 5873923, >>>>> "sum": 1817.167582057, >>>>> "avgtime": 0.000309361 >>>>> }, >>>>> "read_onode_meta_lat": { >>>>> "avgcount": 19608201, >>>>> "sum": 146.770464482, >>>>> "avgtime": 0.000007485 >>>>> }, >>>>> "read_wait_aio_lat": { >>>>> "avgcount": 13734278, >>>>> "sum": 2532.578077242, >>>>> "avgtime": 0.000184398 >>>>> }, >>>>> "compress_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "decompress_lat": { >>>>> "avgcount": 1346945, >>>>> "sum": 26.227575896, >>>>> "avgtime": 0.000019471 >>>>> }, >>>>> "csum_lat": { >>>>> "avgcount": 28020392, >>>>> "sum": 149.587819041, >>>>> "avgtime": 0.000005338 >>>>> }, >>>>> "compress_success_count": 0, >>>>> "compress_rejected_count": 0, >>>>> "write_pad_bytes": 352923605, >>>>> "deferred_write_ops": 24373340, >>>>> "deferred_write_bytes": 216791842816, >>>>> "write_penalty_read_ops": 8062366, >>>>> "bluestore_allocated": 3765566013440, >>>>> "bluestore_stored": 4186255221852, >>>>> "bluestore_compressed": 39981379040, >>>>> "bluestore_compressed_allocated": 73748348928, >>>>> "bluestore_compressed_original": 165041381376, >>>>> "bluestore_onodes": 104232, >>>>> "bluestore_onode_hits": 71206874, >>>>> "bluestore_onode_misses": 1217914, >>>>> "bluestore_onode_shard_hits": 260183292, >>>>> "bluestore_onode_shard_misses": 22851573, >>>>> "bluestore_extents": 3394513, >>>>> "bluestore_blobs": 2773587, >>>>> "bluestore_buffers": 0, >>>>> "bluestore_buffer_bytes": 0, >>>>> "bluestore_buffer_hit_bytes": 62026011221, >>>>> "bluestore_buffer_miss_bytes": 995233669922, >>>>> "bluestore_write_big": 5648815, >>>>> "bluestore_write_big_bytes": 552502214656, >>>>> "bluestore_write_big_blobs": 12440992, >>>>> "bluestore_write_small": 35883770, >>>>> "bluestore_write_small_bytes": 223436965719, >>>>> "bluestore_write_small_unused": 408125, >>>>> "bluestore_write_small_deferred": 34961455, >>>>> "bluestore_write_small_pre_read": 34961455, >>>>> "bluestore_write_small_new": 514190, >>>>> "bluestore_txc": 30484924, >>>>> "bluestore_onode_reshard": 5144189, >>>>> "bluestore_blob_split": 60104, >>>>> "bluestore_extent_compress": 53347252, >>>>> "bluestore_gc_merged": 21142528, >>>>> "bluestore_read_eio": 0, >>>>> "bluestore_fragmentation_micros": 67 >>>>> }, >>>>> "finisher-defered_finisher": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "finisher-finisher-0": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 26625163, >>>>> "sum": 1057.506990951, >>>>> "avgtime": 0.000039718 >>>>> } >>>>> }, >>>>> "finisher-objecter-finisher-0": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.0::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.0::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.1::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.1::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.2::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.2::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.3::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.3::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.4::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.4::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.5::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.5::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.6::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.6::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.7::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.7::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "objecter": { >>>>> "op_active": 0, >>>>> "op_laggy": 0, >>>>> "op_send": 0, >>>>> "op_send_bytes": 0, >>>>> "op_resend": 0, >>>>> "op_reply": 0, >>>>> "op": 0, >>>>> "op_r": 0, >>>>> "op_w": 0, >>>>> "op_rmw": 0, >>>>> "op_pg": 0, >>>>> "osdop_stat": 0, >>>>> "osdop_create": 0, >>>>> "osdop_read": 0, >>>>> "osdop_write": 0, >>>>> "osdop_writefull": 0, >>>>> "osdop_writesame": 0, >>>>> "osdop_append": 0, >>>>> "osdop_zero": 0, >>>>> "osdop_truncate": 0, >>>>> "osdop_delete": 0, >>>>> "osdop_mapext": 0, >>>>> "osdop_sparse_read": 0, >>>>> "osdop_clonerange": 0, >>>>> "osdop_getxattr": 0, >>>>> "osdop_setxattr": 0, >>>>> "osdop_cmpxattr": 0, >>>>> "osdop_rmxattr": 0, >>>>> "osdop_resetxattrs": 0, >>>>> "osdop_tmap_up": 0, >>>>> "osdop_tmap_put": 0, >>>>> "osdop_tmap_get": 0, >>>>> "osdop_call": 0, >>>>> "osdop_watch": 0, >>>>> "osdop_notify": 0, >>>>> "osdop_src_cmpxattr": 0, >>>>> "osdop_pgls": 0, >>>>> "osdop_pgls_filter": 0, >>>>> "osdop_other": 0, >>>>> "linger_active": 0, >>>>> "linger_send": 0, >>>>> "linger_resend": 0, >>>>> "linger_ping": 0, >>>>> "poolop_active": 0, >>>>> "poolop_send": 0, >>>>> "poolop_resend": 0, >>>>> "poolstat_active": 0, >>>>> "poolstat_send": 0, >>>>> "poolstat_resend": 0, >>>>> "statfs_active": 0, >>>>> "statfs_send": 0, >>>>> "statfs_resend": 0, >>>>> "command_active": 0, >>>>> "command_send": 0, >>>>> "command_resend": 0, >>>>> "map_epoch": 105913, >>>>> "map_full": 0, >>>>> "map_inc": 828, >>>>> "osd_sessions": 0, >>>>> "osd_session_open": 0, >>>>> "osd_session_close": 0, >>>>> "osd_laggy": 0, >>>>> "omap_wr": 0, >>>>> "omap_rd": 0, >>>>> "omap_del": 0 >>>>> }, >>>>> "osd": { >>>>> "op_wip": 0, >>>>> "op": 16758102, >>>>> "op_in_bytes": 238398820586, >>>>> "op_out_bytes": 165484999463, >>>>> "op_latency": { >>>>> "avgcount": 16758102, >>>>> "sum": 38242.481640842, >>>>> "avgtime": 0.002282029 >>>>> }, >>>>> "op_process_latency": { >>>>> "avgcount": 16758102, >>>>> "sum": 28644.906310687, >>>>> "avgtime": 0.001709316 >>>>> }, >>>>> "op_prepare_latency": { >>>>> "avgcount": 16761367, >>>>> "sum": 3489.856599934, >>>>> "avgtime": 0.000208208 >>>>> }, >>>>> "op_r": 6188565, >>>>> "op_r_out_bytes": 165484999463, >>>>> "op_r_latency": { >>>>> "avgcount": 6188565, >>>>> "sum": 4507.365756792, >>>>> "avgtime": 0.000728337 >>>>> }, >>>>> "op_r_process_latency": { >>>>> "avgcount": 6188565, >>>>> "sum": 942.363063429, >>>>> "avgtime": 0.000152274 >>>>> }, >>>>> "op_r_prepare_latency": { >>>>> "avgcount": 6188644, >>>>> "sum": 982.866710389, >>>>> "avgtime": 0.000158817 >>>>> }, >>>>> "op_w": 10546037, >>>>> "op_w_in_bytes": 238334329494, >>>>> "op_w_latency": { >>>>> "avgcount": 10546037, >>>>> "sum": 33160.719998316, >>>>> "avgtime": 0.003144377 >>>>> }, >>>>> "op_w_process_latency": { >>>>> "avgcount": 10546037, >>>>> "sum": 27668.702029030, >>>>> "avgtime": 0.002623611 >>>>> }, >>>>> "op_w_prepare_latency": { >>>>> "avgcount": 10548652, >>>>> "sum": 2499.688609173, >>>>> "avgtime": 0.000236967 >>>>> }, >>>>> "op_rw": 23500, >>>>> "op_rw_in_bytes": 64491092, >>>>> "op_rw_out_bytes": 0, >>>>> "op_rw_latency": { >>>>> "avgcount": 23500, >>>>> "sum": 574.395885734, >>>>> "avgtime": 0.024442378 >>>>> }, >>>>> "op_rw_process_latency": { >>>>> "avgcount": 23500, >>>>> "sum": 33.841218228, >>>>> "avgtime": 0.001440051 >>>>> }, >>>>> "op_rw_prepare_latency": { >>>>> "avgcount": 24071, >>>>> "sum": 7.301280372, >>>>> "avgtime": 0.000303322 >>>>> }, >>>>> "op_before_queue_op_lat": { >>>>> "avgcount": 57892986, >>>>> "sum": 1502.117718889, >>>>> "avgtime": 0.000025946 >>>>> }, >>>>> "op_before_dequeue_op_lat": { >>>>> "avgcount": 58091683, >>>>> "sum": 45194.453254037, >>>>> "avgtime": 0.000777984 >>>>> }, >>>>> "subop": 19784758, >>>>> "subop_in_bytes": 547174969754, >>>>> "subop_latency": { >>>>> "avgcount": 19784758, >>>>> "sum": 13019.714424060, >>>>> "avgtime": 0.000658067 >>>>> }, >>>>> "subop_w": 19784758, >>>>> "subop_w_in_bytes": 547174969754, >>>>> "subop_w_latency": { >>>>> "avgcount": 19784758, >>>>> "sum": 13019.714424060, >>>>> "avgtime": 0.000658067 >>>>> }, >>>>> "subop_pull": 0, >>>>> "subop_pull_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "subop_push": 0, >>>>> "subop_push_in_bytes": 0, >>>>> "subop_push_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "pull": 0, >>>>> "push": 2003, >>>>> "push_out_bytes": 5560009728, >>>>> "recovery_ops": 1940, >>>>> "loadavg": 118, >>>>> "buffer_bytes": 0, >>>>> "history_alloc_Mbytes": 0, >>>>> "history_alloc_num": 0, >>>>> "cached_crc": 0, >>>>> "cached_crc_adjusted": 0, >>>>> "missed_crc": 0, >>>>> "numpg": 243, >>>>> "numpg_primary": 82, >>>>> "numpg_replica": 161, >>>>> "numpg_stray": 0, >>>>> "numpg_removing": 0, >>>>> "heartbeat_to_peers": 10, >>>>> "map_messages": 7013, >>>>> "map_message_epochs": 7143, >>>>> "map_message_epoch_dups": 6315, >>>>> "messages_delayed_for_map": 0, >>>>> "osd_map_cache_hit": 203309, >>>>> "osd_map_cache_miss": 33, >>>>> "osd_map_cache_miss_low": 0, >>>>> "osd_map_cache_miss_low_avg": { >>>>> "avgcount": 0, >>>>> "sum": 0 >>>>> }, >>>>> "osd_map_bl_cache_hit": 47012, >>>>> "osd_map_bl_cache_miss": 1681, >>>>> "stat_bytes": 6401248198656, >>>>> "stat_bytes_used": 3777979072512, >>>>> "stat_bytes_avail": 2623269126144, >>>>> "copyfrom": 0, >>>>> "tier_promote": 0, >>>>> "tier_flush": 0, >>>>> "tier_flush_fail": 0, >>>>> "tier_try_flush": 0, >>>>> "tier_try_flush_fail": 0, >>>>> "tier_evict": 0, >>>>> "tier_whiteout": 1631, >>>>> "tier_dirty": 22360, >>>>> "tier_clean": 0, >>>>> "tier_delay": 0, >>>>> "tier_proxy_read": 0, >>>>> "tier_proxy_write": 0, >>>>> "agent_wake": 0, >>>>> "agent_skip": 0, >>>>> "agent_flush": 0, >>>>> "agent_evict": 0, >>>>> "object_ctx_cache_hit": 16311156, >>>>> "object_ctx_cache_total": 17426393, >>>>> "op_cache_hit": 0, >>>>> "osd_tier_flush_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "osd_tier_promote_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "osd_tier_r_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "osd_pg_info": 30483113, >>>>> "osd_pg_fastinfo": 29619885, >>>>> "osd_pg_biginfo": 81703 >>>>> }, >>>>> "recoverystate_perf": { >>>>> "initial_latency": { >>>>> "avgcount": 243, >>>>> "sum": 6.869296500, >>>>> "avgtime": 0.028268709 >>>>> }, >>>>> "started_latency": { >>>>> "avgcount": 1125, >>>>> "sum": 13551384.917335850, >>>>> "avgtime": 12045.675482076 >>>>> }, >>>>> "reset_latency": { >>>>> "avgcount": 1368, >>>>> "sum": 1101.727799040, >>>>> "avgtime": 0.805356578 >>>>> }, >>>>> "start_latency": { >>>>> "avgcount": 1368, >>>>> "sum": 0.002014799, >>>>> "avgtime": 0.000001472 >>>>> }, >>>>> "primary_latency": { >>>>> "avgcount": 507, >>>>> "sum": 4575560.638823428, >>>>> "avgtime": 9024.774435549 >>>>> }, >>>>> "peering_latency": { >>>>> "avgcount": 550, >>>>> "sum": 499.372283616, >>>>> "avgtime": 0.907949606 >>>>> }, >>>>> "backfilling_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "waitremotebackfillreserved_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "waitlocalbackfillreserved_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "notbackfilling_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "repnotrecovering_latency": { >>>>> "avgcount": 1009, >>>>> "sum": 8975301.082274411, >>>>> "avgtime": 8895.243887288 >>>>> }, >>>>> "repwaitrecoveryreserved_latency": { >>>>> "avgcount": 420, >>>>> "sum": 99.846056520, >>>>> "avgtime": 0.237728706 >>>>> }, >>>>> "repwaitbackfillreserved_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "reprecovering_latency": { >>>>> "avgcount": 420, >>>>> "sum": 241.682764382, >>>>> "avgtime": 0.575435153 >>>>> }, >>>>> "activating_latency": { >>>>> "avgcount": 507, >>>>> "sum": 16.893347339, >>>>> "avgtime": 0.033320211 >>>>> }, >>>>> "waitlocalrecoveryreserved_latency": { >>>>> "avgcount": 199, >>>>> "sum": 672.335512769, >>>>> "avgtime": 3.378570415 >>>>> }, >>>>> "waitremoterecoveryreserved_latency": { >>>>> "avgcount": 199, >>>>> "sum": 213.536439363, >>>>> "avgtime": 1.073047433 >>>>> }, >>>>> "recovering_latency": { >>>>> "avgcount": 199, >>>>> "sum": 79.007696479, >>>>> "avgtime": 0.397023600 >>>>> }, >>>>> "recovered_latency": { >>>>> "avgcount": 507, >>>>> "sum": 14.000732748, >>>>> "avgtime": 0.027614857 >>>>> }, >>>>> "clean_latency": { >>>>> "avgcount": 395, >>>>> "sum": 4574325.900371083, >>>>> "avgtime": 11580.571899673 >>>>> }, >>>>> "active_latency": { >>>>> "avgcount": 425, >>>>> "sum": 4575107.630123680, >>>>> "avgtime": 10764.959129702 >>>>> }, >>>>> "replicaactive_latency": { >>>>> "avgcount": 589, >>>>> "sum": 8975184.499049954, >>>>> "avgtime": 15238.004242869 >>>>> }, >>>>> "stray_latency": { >>>>> "avgcount": 818, >>>>> "sum": 800.729455666, >>>>> "avgtime": 0.978886865 >>>>> }, >>>>> "getinfo_latency": { >>>>> "avgcount": 550, >>>>> "sum": 15.085667048, >>>>> "avgtime": 0.027428485 >>>>> }, >>>>> "getlog_latency": { >>>>> "avgcount": 546, >>>>> "sum": 3.482175693, >>>>> "avgtime": 0.006377611 >>>>> }, >>>>> "waitactingchange_latency": { >>>>> "avgcount": 39, >>>>> "sum": 35.444551284, >>>>> "avgtime": 0.908834648 >>>>> }, >>>>> "incomplete_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "down_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "getmissing_latency": { >>>>> "avgcount": 507, >>>>> "sum": 6.702129624, >>>>> "avgtime": 0.013219190 >>>>> }, >>>>> "waitupthru_latency": { >>>>> "avgcount": 507, >>>>> "sum": 474.098261727, >>>>> "avgtime": 0.935105052 >>>>> }, >>>>> "notrecovering_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "rocksdb": { >>>>> "get": 28320977, >>>>> "submit_transaction": 30484924, >>>>> "submit_transaction_sync": 26371957, >>>>> "get_latency": { >>>>> "avgcount": 28320977, >>>>> "sum": 325.900908733, >>>>> "avgtime": 0.000011507 >>>>> }, >>>>> "submit_latency": { >>>>> "avgcount": 30484924, >>>>> "sum": 1835.888692371, >>>>> "avgtime": 0.000060222 >>>>> }, >>>>> "submit_sync_latency": { >>>>> "avgcount": 26371957, >>>>> "sum": 1431.555230628, >>>>> "avgtime": 0.000054283 >>>>> }, >>>>> "compact": 0, >>>>> "compact_range": 0, >>>>> "compact_queue_merge": 0, >>>>> "compact_queue_len": 0, >>>>> "rocksdb_write_wal_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "rocksdb_write_memtable_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "rocksdb_write_delay_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "rocksdb_write_pre_and_post_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> } >>>>> } >>>>> >>>>> ----- Mail original ----- >>>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark >>> Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, >>> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" >>> <ceph-devel@xxxxxxxxxxxxxxx> >>>>> Envoyé: Mardi 5 Février 2019 18:56:51 >>>>> Objet: Re: ceph osd commit latency increase over time, >>> until restart >>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: >>>>>>>> but I don't see l_bluestore_fragmentation counter. >>>>>>>> (but I have bluestore_fragmentation_micros) >>>>>> ok, this is the same >>>>>> >>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", >>>>>> "How fragmented bluestore free space is (free extents / max >>> possible number of free extents) * 1000"); >>>>>> >>>>>> Here a graph on last month, with bluestore_fragmentation_micros and >>> latency, >>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png >>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't >>>>> it? The same for other OSDs? >>>>> >>>>> This proves some issue with the allocator - generally fragmentation >>>>> might grow but it shouldn't reset on restart. Looks like some intervals >>>>> aren't properly merged in run-time. >>>>> >>>>> On the other side I'm not completely sure that latency degradation is >>>>> caused by that - fragmentation growth is relatively small - I don't see >>>>> how this might impact performance that high. >>>>> >>>>> Wondering if you have OSD mempool monitoring (dump_mempools command >>>>> output on admin socket) reports? Do you have any historic data? >>>>> >>>>> If not may I have current output and say a couple more samples with >>>>> 8-12 hours interval? >>>>> >>>>> >>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such >>> plans >>>>> before that but I'll discuss this at BlueStore meeting shortly. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Igor >>>>> >>>>>> ----- Mail original ----- >>>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> >>>>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark >>> Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, >>> "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" >>> <ceph-devel@xxxxxxxxxxxxxxx> >>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 >>>>>> Objet: Re: ceph osd commit latency increase over time, >>> until restart >>>>>> Thanks Igor, >>>>>> >>>>>>>> Could you please collect BlueStore performance counters right >>> after OSD >>>>>>>> startup and once you get high latency. >>>>>>>> >>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. >>>>>> I'm already monitoring with >>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all >>> counters) >>>>>> but I don't see l_bluestore_fragmentation counter. >>>>>> >>>>>> (but I have bluestore_fragmentation_micros) >>>>>> >>>>>> >>>>>>>> Also if you're able to rebuild the code I can probably make a simple >>>>>>>> patch to track latency and some other internal allocator's >>> paramter to >>>>>>>> make sure it's degraded and learn more details. >>>>>> Sorry, It's a critical production cluster, I can't test on it :( >>>>>> But I have a test cluster, maybe I can try to put some load on it, >>> and try to reproduce. >>>>>> >>>>>> >>>>>>>> More vigorous fix would be to backport bitmap allocator from >>> Nautilus >>>>>>>> and try the difference... >>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) >>>>>> perf results of new bitmap allocator seem very promising from what >>> I've seen in PR. >>>>>> >>>>>> >>>>>> ----- Mail original ----- >>>>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Stefan Priebe, >>> Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx> >>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" >>> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 >>>>>> Objet: Re: ceph osd commit latency increase over time, >>> until restart >>>>>> Hi Alexandre, >>>>>> >>>>>> looks like a bug in StupidAllocator. >>>>>> >>>>>> Could you please collect BlueStore performance counters right after >>> OSD >>>>>> startup and once you get high latency. >>>>>> >>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. >>>>>> >>>>>> Also if you're able to rebuild the code I can probably make a simple >>>>>> patch to track latency and some other internal allocator's paramter to >>>>>> make sure it's degraded and learn more details. >>>>>> >>>>>> >>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus >>>>>> and try the difference... >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Igor >>>>>> >>>>>> >>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: >>>>>>> Hi again, >>>>>>> >>>>>>> I speak too fast, the problem has occured again, so it's not >>> tcmalloc cache size related. >>>>>>> >>>>>>> I have notice something using a simple "perf top", >>>>>>> >>>>>>> each time I have this problem (I have seen exactly 4 times the >>> same behaviour), >>>>>>> when latency is bad, perf top give me : >>>>>>> >>>>>>> StupidAllocator::_aligned_len >>>>>>> and >>>>>>> >>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned >>> long, unsigned long, std::less<unsigned long>, mempoo >>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned >>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, >>> unsigned long>&, std::pair<unsigned long >>>>>>> const, unsigned long>*>::increment_slow() >>>>>>> >>>>>>> (around 10-20% time for both) >>>>>>> >>>>>>> >>>>>>> when latency is good, I don't see them at all. >>>>>>> >>>>>>> >>>>>>> I have used the Mark wallclock profiler, here the results: >>>>>>> >>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt >>>>>>> >>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt >>>>>>> >>>>>>> >>>>>>> here an extract of the thread with btree::btree_iterator && >>> StupidAllocator::_aligned_len >>>>>>> >>>>>>> + 100.00% clone >>>>>>> + 100.00% start_thread >>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() >>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) >>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, >>> ceph::heartbeat_handle_d*) >>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, >>> ThreadPool::TPHandle&) >>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, >>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) >>>>>>> | + 70.00% >>> PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, >>> ThreadPool::TPHandle&) >>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) >>>>>>> | | + 68.00% >>> ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) >>>>>>> | | + 68.00% >>> ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) >>>>>>> | | + 67.00% non-virtual thunk to >>> PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, >>> std::allocator<ObjectStore::Transaction> >&, >>> boost::intrusive_ptr<OpRequest>) >>>>>>> | | | + 67.00% >>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, >>> std::vector<ObjectStore::Transaction, >>> std::allocator<ObjectStore::Transaction> >&, >>> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) >>>>>>> | | | + 66.00% >>> BlueStore::_txc_add_transaction(BlueStore::TransContext*, >>> ObjectStore::Transaction*) >>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, >>> boost::intrusive_ptr<BlueStore::Collection>&, >>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, >>> ceph::buffer::list&, unsigned int) >>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, >>> boost::intrusive_ptr<BlueStore::Collection>&, >>> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, >>> ceph::buffer::list&, unsigned int) >>>>>>> | | | | + 65.00% >>> BlueStore::_do_alloc_write(BlueStore::TransContext*, >>> boost::intrusive_ptr<BlueStore::Collection>, >>> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) >>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, >>> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, >>> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) >>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, >>> unsigned long, long, unsigned long*, unsigned int*) >>>>>>> | | | | | | + 34.00% >>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned >>> long, unsigned long, std::less<unsigned long>, >>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned >>> long const, unsigned long> >, 256> >, std::pair<unsigned long const, >>> unsigned long>&, std::pair<unsigned long const, unsigned >>> long>*>::increment_slow() >>>>>>> | | | | | | + 26.00% >>> StupidAllocator::_aligned_len(interval_set<unsigned long, >>> btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, >>> mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned >>> long const, unsigned long> >, 256> >::iterator, unsigned long) >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> >>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" >>> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 >>>>>>> Objet: Re: ceph osd commit latency increase over >>> time, until restart >>>>>>> Hi, >>>>>>> >>>>>>> some news: >>>>>>> >>>>>>> I have tried with different transparent hugepage values (madvise, >>> never) : no change >>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change >>>>>>> >>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to >>> 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait >>> some more days to be sure) >>>>>>> >>>>>>> Note that this behaviour seem to happen really faster (< 2 days) >>> on my big nvme drives (6TB), >>>>>>> my others clusters user 1,6TB ssd. >>>>>>> >>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than >>> 5000iops by osd), but I'll try this week with 2osd by nvme, to see if >>> it's helping. >>>>>>> >>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with >>> glibc >= 2.26 (which have also thread cache) ? >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Alexandre >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> De: "aderumier" <aderumier@xxxxxxxxx> >>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" >>> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 >>>>>>> Objet: Re: ceph osd commit latency increase over >>> time, until restart >>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not >>>>>>>>> op_r_latency but instead op_latency? >>>>>>>>> >>>>>>>>> Also why do you monitor op_w_process_latency? but not >>> op_r_process_latency? >>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot >>> of graphs). >>>>>>> I just don't see latency difference on reads. (or they are very >>> very small vs the write latency increase) >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" >>> <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 >>>>>>> Objet: Re: ceph osd commit latency increase over >>> time, until restart >>>>>>> Hi, >>>>>>> >>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: >>>>>>>> Hi Stefan, >>>>>>>> >>>>>>>>>> currently i'm in the process of switching back from jemalloc to >>> tcmalloc >>>>>>>>>> like suggested. This report makes me a little nervous about my >>> change. >>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. >>>>>>>> maybe bluestore related (don't have filestore anymore to compare) >>>>>>>> I need to compare with bigger latencies >>>>>>>> >>>>>>>> here an example, when all osd at 20-50ms before restart, then >>> after restart (at 21:15), 1ms >>>>>>>> http://odisoweb1.odiso.net/latencybad.png >>>>>>>> >>>>>>>> I observe the latency in my guest vm too, on disks iowait. >>>>>>>> >>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png >>>>>>>> >>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. >>> Which >>>>>>>>>> exact values out of the daemon do you use for bluestore? >>>>>>>> here my influxdb queries: >>>>>>>> >>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. >>>>>>>> >>>>>>>> >>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), >>> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" >>> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter >>> GROUP BY time($interval), "host", "id" fill(previous) >>>>>>>> >>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), >>> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM >>> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ >>> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" >>> fill(previous) >>>>>>>> >>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), >>> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) >>> FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" >>> =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" >>> fill(previous) >>>>>>> Thanks. Is there any reason you monitor op_w_latency but not >>>>>>> op_r_latency but instead op_latency? >>>>>>> >>>>>>> Also why do you monitor op_w_process_latency? but not >>> op_r_process_latency? >>>>>>> greets, >>>>>>> Stefan >>>>>>> >>>>>>>> ----- Mail original ----- >>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>>> À: "aderumier" <aderumier@xxxxxxxxx>, "Sage Weil" >>> <sage@xxxxxxxxxxxx> >>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" >>> <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 >>>>>>>> Objet: Re: ceph osd commit latency increase over >>> time, until restart >>>>>>>> Hi, >>>>>>>> >>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> here some new results, >>>>>>>>> different osd/ different cluster >>>>>>>>> >>>>>>>>> before osd restart latency was between 2-5ms >>>>>>>>> after osd restart is around 1-1.5ms >>>>>>>>> >>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) >>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) >>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt >>>>>>>>> >>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, >>> but maybe I'm wrong. >>>>>>>>> (I'm using tcmalloc 2.5-2.2) >>>>>>>> currently i'm in the process of switching back from jemalloc to >>> tcmalloc >>>>>>>> like suggested. This report makes me a little nervous about my >>> change. >>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which >>>>>>>> exact values out of the daemon do you use for bluestore? >>>>>>>> >>>>>>>> I would like to check if i see the same behaviour. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> >>>>>>>>> ----- Mail original ----- >>>>>>>>> De: "Sage Weil" <sage@xxxxxxxxxxxx> >>>>>>>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" >>> <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 >>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until >>> restart >>>>>>>>> Can you capture a perf top or perf record to see where teh CPU >>> time is >>>>>>>>> going on one of the OSDs wth a high latency? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> sage >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, >>>>>>>>>> >>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or >>> nvme drivers, >>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + >>> snapshot/rbd export-diff/snapshotdelete each day for backup >>>>>>>>>> When the osd are refreshly started, the commit latency is >>> between 0,5-1ms. >>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by >>> day), until reaching crazy >>>>>>>>>> values like 20-200ms. >>>>>>>>>> >>>>>>>>>> Some example graphs: >>>>>>>>>> >>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png >>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png >>>>>>>>>> >>>>>>>>>> All osds have this behaviour, in all clusters. >>>>>>>>>> >>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be >>> full loaded) >>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. >>>>>>>>>> >>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a >>> bluestore memory bug ? >>>>>>>>>> Any Hints for counters/logs to check ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Alexandre >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>> >>>> >>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: >>>> Hi Igor, >>>> >>>> Thanks again for helping ! >>>> >>>> >>>> >>>> I have upgrade to last mimic this weekend, and with new autotune memory, >>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) >>>> >>>> >>>> I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, >>>> here the reports for osd.0: >>>> >>>> http://odisoweb1.odiso.net/perfanalysis/ >>>> >>>> >>>> osd has been started the 12-02-2019 at 08:00 >>>> >>>> first report after 1h running >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt >>>> >>>> >>>> >>>> report after 24 before counter resets >>>> >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt >>>> >>>> report 1h after counter reset >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt >>>> >>>> >>>> >>>> >>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 >>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png >>>> Then after that, slowly decreasing. >>>> >>>> >>>> Another strange thing, >>>> I'm seeing total bytes at 5G at 12-02-2018.13:30 >>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt >>>> Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G >>>> >>>> >>>> I'm graphing mempools counters too since yesterday, so I'll able to track them over time. >>>> >>>> ----- Mail original ----- >>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx> >>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>> Envoyé: Lundi 11 Février 2019 12:03:17 >>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>> >>>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: >>>>> another mempool dump after 1h run. (latency ok) >>>>> >>>>> Biggest difference: >>>>> >>>>> before restart >>>>> ------------- >>>>> "bluestore_cache_other": { >>>>> "items": 48661920, >>>>> "bytes": 1539544228 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 54, >>>>> "bytes": 643072 >>>>> }, >>>>> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) >>>>> >>>>> >>>>> After restart >>>>> ------------- >>>>> "bluestore_cache_other": { >>>>> "items": 12432298, >>>>> "bytes": 500834899 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 40084, >>>>> "bytes": 1056235520 >>>>> }, >>>>> >>>> This is fine as cache is warming after restart and some rebalancing >>>> between data and metadata might occur. >>>> >>>> What relates to allocator and most probably to fragmentation growth is : >>>> >>>> "bluestore_alloc": { >>>> "items": 165053952, >>>> "bytes": 165053952 >>>> }, >>>> >>>> which had been higher before the reset (if I got these dumps' order >>>> properly) >>>> >>>> "bluestore_alloc": { >>>> "items": 210243456, >>>> "bytes": 210243456 >>>> }, >>>> >>>> But as I mentioned - I'm not 100% sure this might cause such a huge >>>> latency increase... >>>> >>>> Do you have perf counters dump after the restart? >>>> >>>> Could you collect some more dumps - for both mempool and perf counters? >>>> >>>> So ideally I'd like to have: >>>> >>>> 1) mempool/perf counters dumps after the restart (1hour is OK) >>>> >>>> 2) mempool/perf counters dumps in 24+ hours after restart >>>> >>>> 3) reset perf counters after 2), wait for 1 hour (and without OSD >>>> restart) and dump mempool/perf counters again. >>>> >>>> So we'll be able to learn both allocator mem usage growth and operation >>>> latency distribution for the following periods: >>>> >>>> a) 1st hour after restart >>>> >>>> b) 25th hour. >>>> >>>> >>>> Thanks, >>>> >>>> Igor >>>> >>>> >>>>> full mempool dump after restart >>>>> ------------------------------- >>>>> >>>>> { >>>>> "mempool": { >>>>> "by_pool": { >>>>> "bloom_filter": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_alloc": { >>>>> "items": 165053952, >>>>> "bytes": 165053952 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 40084, >>>>> "bytes": 1056235520 >>>>> }, >>>>> "bluestore_cache_onode": { >>>>> "items": 22225, >>>>> "bytes": 14935200 >>>>> }, >>>>> "bluestore_cache_other": { >>>>> "items": 12432298, >>>>> "bytes": 500834899 >>>>> }, >>>>> "bluestore_fsck": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_txc": { >>>>> "items": 11, >>>>> "bytes": 8184 >>>>> }, >>>>> "bluestore_writing_deferred": { >>>>> "items": 5047, >>>>> "bytes": 22673736 >>>>> }, >>>>> "bluestore_writing": { >>>>> "items": 91, >>>>> "bytes": 1662976 >>>>> }, >>>>> "bluefs": { >>>>> "items": 1907, >>>>> "bytes": 95600 >>>>> }, >>>>> "buffer_anon": { >>>>> "items": 19664, >>>>> "bytes": 25486050 >>>>> }, >>>>> "buffer_meta": { >>>>> "items": 46189, >>>>> "bytes": 2956096 >>>>> }, >>>>> "osd": { >>>>> "items": 243, >>>>> "bytes": 3089016 >>>>> }, >>>>> "osd_mapbl": { >>>>> "items": 17, >>>>> "bytes": 214366 >>>>> }, >>>>> "osd_pglog": { >>>>> "items": 889673, >>>>> "bytes": 367160400 >>>>> }, >>>>> "osdmap": { >>>>> "items": 3803, >>>>> "bytes": 224552 >>>>> }, >>>>> "osdmap_mapping": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "pgmap": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "mds_co": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_1": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_2": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> } >>>>> }, >>>>> "total": { >>>>> "items": 178515204, >>>>> "bytes": 2160630547 >>>>> } >>>>> } >>>>> } >>>>> >>>>> ----- Mail original ----- >>>>> De: "aderumier" <aderumier@xxxxxxxxx> >>>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>> Envoyé: Vendredi 8 Février 2019 16:14:54 >>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>> >>>>> I'm just seeing >>>>> >>>>> StupidAllocator::_aligned_len >>>>> and >>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo >>>>> >>>>> on 1 osd, both 10%. >>>>> >>>>> here the dump_mempools >>>>> >>>>> { >>>>> "mempool": { >>>>> "by_pool": { >>>>> "bloom_filter": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_alloc": { >>>>> "items": 210243456, >>>>> "bytes": 210243456 >>>>> }, >>>>> "bluestore_cache_data": { >>>>> "items": 54, >>>>> "bytes": 643072 >>>>> }, >>>>> "bluestore_cache_onode": { >>>>> "items": 105637, >>>>> "bytes": 70988064 >>>>> }, >>>>> "bluestore_cache_other": { >>>>> "items": 48661920, >>>>> "bytes": 1539544228 >>>>> }, >>>>> "bluestore_fsck": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "bluestore_txc": { >>>>> "items": 12, >>>>> "bytes": 8928 >>>>> }, >>>>> "bluestore_writing_deferred": { >>>>> "items": 406, >>>>> "bytes": 4792868 >>>>> }, >>>>> "bluestore_writing": { >>>>> "items": 66, >>>>> "bytes": 1085440 >>>>> }, >>>>> "bluefs": { >>>>> "items": 1882, >>>>> "bytes": 93600 >>>>> }, >>>>> "buffer_anon": { >>>>> "items": 138986, >>>>> "bytes": 24983701 >>>>> }, >>>>> "buffer_meta": { >>>>> "items": 544, >>>>> "bytes": 34816 >>>>> }, >>>>> "osd": { >>>>> "items": 243, >>>>> "bytes": 3089016 >>>>> }, >>>>> "osd_mapbl": { >>>>> "items": 36, >>>>> "bytes": 179308 >>>>> }, >>>>> "osd_pglog": { >>>>> "items": 952564, >>>>> "bytes": 372459684 >>>>> }, >>>>> "osdmap": { >>>>> "items": 3639, >>>>> "bytes": 224664 >>>>> }, >>>>> "osdmap_mapping": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "pgmap": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "mds_co": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_1": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> }, >>>>> "unittest_2": { >>>>> "items": 0, >>>>> "bytes": 0 >>>>> } >>>>> }, >>>>> "total": { >>>>> "items": 260109445, >>>>> "bytes": 2228370845 >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> and the perf dump >>>>> >>>>> root@ceph5-2:~# ceph daemon osd.4 perf dump >>>>> { >>>>> "AsyncMessenger::Worker-0": { >>>>> "msgr_recv_messages": 22948570, >>>>> "msgr_send_messages": 22561570, >>>>> "msgr_recv_bytes": 333085080271, >>>>> "msgr_send_bytes": 261798871204, >>>>> "msgr_created_connections": 6152, >>>>> "msgr_active_connections": 2701, >>>>> "msgr_running_total_time": 1055.197867330, >>>>> "msgr_running_send_time": 352.764480121, >>>>> "msgr_running_recv_time": 499.206831955, >>>>> "msgr_running_fast_dispatch_time": 130.982201607 >>>>> }, >>>>> "AsyncMessenger::Worker-1": { >>>>> "msgr_recv_messages": 18801593, >>>>> "msgr_send_messages": 18430264, >>>>> "msgr_recv_bytes": 306871760934, >>>>> "msgr_send_bytes": 192789048666, >>>>> "msgr_created_connections": 5773, >>>>> "msgr_active_connections": 2721, >>>>> "msgr_running_total_time": 816.821076305, >>>>> "msgr_running_send_time": 261.353228926, >>>>> "msgr_running_recv_time": 394.035587911, >>>>> "msgr_running_fast_dispatch_time": 104.012155720 >>>>> }, >>>>> "AsyncMessenger::Worker-2": { >>>>> "msgr_recv_messages": 18463400, >>>>> "msgr_send_messages": 18105856, >>>>> "msgr_recv_bytes": 187425453590, >>>>> "msgr_send_bytes": 220735102555, >>>>> "msgr_created_connections": 5897, >>>>> "msgr_active_connections": 2605, >>>>> "msgr_running_total_time": 807.186854324, >>>>> "msgr_running_send_time": 296.834435839, >>>>> "msgr_running_recv_time": 351.364389691, >>>>> "msgr_running_fast_dispatch_time": 101.215776792 >>>>> }, >>>>> "bluefs": { >>>>> "gift_bytes": 0, >>>>> "reclaim_bytes": 0, >>>>> "db_total_bytes": 256050724864, >>>>> "db_used_bytes": 12413042688, >>>>> "wal_total_bytes": 0, >>>>> "wal_used_bytes": 0, >>>>> "slow_total_bytes": 0, >>>>> "slow_used_bytes": 0, >>>>> "num_files": 209, >>>>> "log_bytes": 10383360, >>>>> "log_compactions": 14, >>>>> "logged_bytes": 336498688, >>>>> "files_written_wal": 2, >>>>> "files_written_sst": 4499, >>>>> "bytes_written_wal": 417989099783, >>>>> "bytes_written_sst": 213188750209 >>>>> }, >>>>> "bluestore": { >>>>> "kv_flush_lat": { >>>>> "avgcount": 26371957, >>>>> "sum": 26.734038497, >>>>> "avgtime": 0.000001013 >>>>> }, >>>>> "kv_commit_lat": { >>>>> "avgcount": 26371957, >>>>> "sum": 3397.491150603, >>>>> "avgtime": 0.000128829 >>>>> }, >>>>> "kv_lat": { >>>>> "avgcount": 26371957, >>>>> "sum": 3424.225189100, >>>>> "avgtime": 0.000129843 >>>>> }, >>>>> "state_prepare_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 3689.542105337, >>>>> "avgtime": 0.000121028 >>>>> }, >>>>> "state_aio_wait_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 509.864546111, >>>>> "avgtime": 0.000016725 >>>>> }, >>>>> "state_io_done_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 24.534052953, >>>>> "avgtime": 0.000000804 >>>>> }, >>>>> "state_kv_queued_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 3488.338424238, >>>>> "avgtime": 0.000114428 >>>>> }, >>>>> "state_kv_commiting_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 5660.437003432, >>>>> "avgtime": 0.000185679 >>>>> }, >>>>> "state_kv_done_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 7.763511500, >>>>> "avgtime": 0.000000254 >>>>> }, >>>>> "state_deferred_queued_lat": { >>>>> "avgcount": 26346134, >>>>> "sum": 666071.296856696, >>>>> "avgtime": 0.025281557 >>>>> }, >>>>> "state_deferred_aio_wait_lat": { >>>>> "avgcount": 26346134, >>>>> "sum": 1755.660547071, >>>>> "avgtime": 0.000066638 >>>>> }, >>>>> "state_deferred_cleanup_lat": { >>>>> "avgcount": 26346134, >>>>> "sum": 185465.151653703, >>>>> "avgtime": 0.007039558 >>>>> }, >>>>> "state_finishing_lat": { >>>>> "avgcount": 30484920, >>>>> "sum": 3.046847481, >>>>> "avgtime": 0.000000099 >>>>> }, >>>>> "state_done_lat": { >>>>> "avgcount": 30484920, >>>>> "sum": 13193.362685280, >>>>> "avgtime": 0.000432783 >>>>> }, >>>>> "throttle_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 14.634269979, >>>>> "avgtime": 0.000000480 >>>>> }, >>>>> "submit_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 3873.883076148, >>>>> "avgtime": 0.000127075 >>>>> }, >>>>> "commit_lat": { >>>>> "avgcount": 30484924, >>>>> "sum": 13376.492317331, >>>>> "avgtime": 0.000438790 >>>>> }, >>>>> "read_lat": { >>>>> "avgcount": 5873923, >>>>> "sum": 1817.167582057, >>>>> "avgtime": 0.000309361 >>>>> }, >>>>> "read_onode_meta_lat": { >>>>> "avgcount": 19608201, >>>>> "sum": 146.770464482, >>>>> "avgtime": 0.000007485 >>>>> }, >>>>> "read_wait_aio_lat": { >>>>> "avgcount": 13734278, >>>>> "sum": 2532.578077242, >>>>> "avgtime": 0.000184398 >>>>> }, >>>>> "compress_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "decompress_lat": { >>>>> "avgcount": 1346945, >>>>> "sum": 26.227575896, >>>>> "avgtime": 0.000019471 >>>>> }, >>>>> "csum_lat": { >>>>> "avgcount": 28020392, >>>>> "sum": 149.587819041, >>>>> "avgtime": 0.000005338 >>>>> }, >>>>> "compress_success_count": 0, >>>>> "compress_rejected_count": 0, >>>>> "write_pad_bytes": 352923605, >>>>> "deferred_write_ops": 24373340, >>>>> "deferred_write_bytes": 216791842816, >>>>> "write_penalty_read_ops": 8062366, >>>>> "bluestore_allocated": 3765566013440, >>>>> "bluestore_stored": 4186255221852, >>>>> "bluestore_compressed": 39981379040, >>>>> "bluestore_compressed_allocated": 73748348928, >>>>> "bluestore_compressed_original": 165041381376, >>>>> "bluestore_onodes": 104232, >>>>> "bluestore_onode_hits": 71206874, >>>>> "bluestore_onode_misses": 1217914, >>>>> "bluestore_onode_shard_hits": 260183292, >>>>> "bluestore_onode_shard_misses": 22851573, >>>>> "bluestore_extents": 3394513, >>>>> "bluestore_blobs": 2773587, >>>>> "bluestore_buffers": 0, >>>>> "bluestore_buffer_bytes": 0, >>>>> "bluestore_buffer_hit_bytes": 62026011221, >>>>> "bluestore_buffer_miss_bytes": 995233669922, >>>>> "bluestore_write_big": 5648815, >>>>> "bluestore_write_big_bytes": 552502214656, >>>>> "bluestore_write_big_blobs": 12440992, >>>>> "bluestore_write_small": 35883770, >>>>> "bluestore_write_small_bytes": 223436965719, >>>>> "bluestore_write_small_unused": 408125, >>>>> "bluestore_write_small_deferred": 34961455, >>>>> "bluestore_write_small_pre_read": 34961455, >>>>> "bluestore_write_small_new": 514190, >>>>> "bluestore_txc": 30484924, >>>>> "bluestore_onode_reshard": 5144189, >>>>> "bluestore_blob_split": 60104, >>>>> "bluestore_extent_compress": 53347252, >>>>> "bluestore_gc_merged": 21142528, >>>>> "bluestore_read_eio": 0, >>>>> "bluestore_fragmentation_micros": 67 >>>>> }, >>>>> "finisher-defered_finisher": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "finisher-finisher-0": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 26625163, >>>>> "sum": 1057.506990951, >>>>> "avgtime": 0.000039718 >>>>> } >>>>> }, >>>>> "finisher-objecter-finisher-0": { >>>>> "queue_len": 0, >>>>> "complete_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.0::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.0::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.1::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.1::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.2::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.2::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.3::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.3::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.4::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.4::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.5::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.5::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.6::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.6::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.7::sdata_wait_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "mutex-OSDShard.7::shard_lock": { >>>>> "wait": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "objecter": { >>>>> "op_active": 0, >>>>> "op_laggy": 0, >>>>> "op_send": 0, >>>>> "op_send_bytes": 0, >>>>> "op_resend": 0, >>>>> "op_reply": 0, >>>>> "op": 0, >>>>> "op_r": 0, >>>>> "op_w": 0, >>>>> "op_rmw": 0, >>>>> "op_pg": 0, >>>>> "osdop_stat": 0, >>>>> "osdop_create": 0, >>>>> "osdop_read": 0, >>>>> "osdop_write": 0, >>>>> "osdop_writefull": 0, >>>>> "osdop_writesame": 0, >>>>> "osdop_append": 0, >>>>> "osdop_zero": 0, >>>>> "osdop_truncate": 0, >>>>> "osdop_delete": 0, >>>>> "osdop_mapext": 0, >>>>> "osdop_sparse_read": 0, >>>>> "osdop_clonerange": 0, >>>>> "osdop_getxattr": 0, >>>>> "osdop_setxattr": 0, >>>>> "osdop_cmpxattr": 0, >>>>> "osdop_rmxattr": 0, >>>>> "osdop_resetxattrs": 0, >>>>> "osdop_tmap_up": 0, >>>>> "osdop_tmap_put": 0, >>>>> "osdop_tmap_get": 0, >>>>> "osdop_call": 0, >>>>> "osdop_watch": 0, >>>>> "osdop_notify": 0, >>>>> "osdop_src_cmpxattr": 0, >>>>> "osdop_pgls": 0, >>>>> "osdop_pgls_filter": 0, >>>>> "osdop_other": 0, >>>>> "linger_active": 0, >>>>> "linger_send": 0, >>>>> "linger_resend": 0, >>>>> "linger_ping": 0, >>>>> "poolop_active": 0, >>>>> "poolop_send": 0, >>>>> "poolop_resend": 0, >>>>> "poolstat_active": 0, >>>>> "poolstat_send": 0, >>>>> "poolstat_resend": 0, >>>>> "statfs_active": 0, >>>>> "statfs_send": 0, >>>>> "statfs_resend": 0, >>>>> "command_active": 0, >>>>> "command_send": 0, >>>>> "command_resend": 0, >>>>> "map_epoch": 105913, >>>>> "map_full": 0, >>>>> "map_inc": 828, >>>>> "osd_sessions": 0, >>>>> "osd_session_open": 0, >>>>> "osd_session_close": 0, >>>>> "osd_laggy": 0, >>>>> "omap_wr": 0, >>>>> "omap_rd": 0, >>>>> "omap_del": 0 >>>>> }, >>>>> "osd": { >>>>> "op_wip": 0, >>>>> "op": 16758102, >>>>> "op_in_bytes": 238398820586, >>>>> "op_out_bytes": 165484999463, >>>>> "op_latency": { >>>>> "avgcount": 16758102, >>>>> "sum": 38242.481640842, >>>>> "avgtime": 0.002282029 >>>>> }, >>>>> "op_process_latency": { >>>>> "avgcount": 16758102, >>>>> "sum": 28644.906310687, >>>>> "avgtime": 0.001709316 >>>>> }, >>>>> "op_prepare_latency": { >>>>> "avgcount": 16761367, >>>>> "sum": 3489.856599934, >>>>> "avgtime": 0.000208208 >>>>> }, >>>>> "op_r": 6188565, >>>>> "op_r_out_bytes": 165484999463, >>>>> "op_r_latency": { >>>>> "avgcount": 6188565, >>>>> "sum": 4507.365756792, >>>>> "avgtime": 0.000728337 >>>>> }, >>>>> "op_r_process_latency": { >>>>> "avgcount": 6188565, >>>>> "sum": 942.363063429, >>>>> "avgtime": 0.000152274 >>>>> }, >>>>> "op_r_prepare_latency": { >>>>> "avgcount": 6188644, >>>>> "sum": 982.866710389, >>>>> "avgtime": 0.000158817 >>>>> }, >>>>> "op_w": 10546037, >>>>> "op_w_in_bytes": 238334329494, >>>>> "op_w_latency": { >>>>> "avgcount": 10546037, >>>>> "sum": 33160.719998316, >>>>> "avgtime": 0.003144377 >>>>> }, >>>>> "op_w_process_latency": { >>>>> "avgcount": 10546037, >>>>> "sum": 27668.702029030, >>>>> "avgtime": 0.002623611 >>>>> }, >>>>> "op_w_prepare_latency": { >>>>> "avgcount": 10548652, >>>>> "sum": 2499.688609173, >>>>> "avgtime": 0.000236967 >>>>> }, >>>>> "op_rw": 23500, >>>>> "op_rw_in_bytes": 64491092, >>>>> "op_rw_out_bytes": 0, >>>>> "op_rw_latency": { >>>>> "avgcount": 23500, >>>>> "sum": 574.395885734, >>>>> "avgtime": 0.024442378 >>>>> }, >>>>> "op_rw_process_latency": { >>>>> "avgcount": 23500, >>>>> "sum": 33.841218228, >>>>> "avgtime": 0.001440051 >>>>> }, >>>>> "op_rw_prepare_latency": { >>>>> "avgcount": 24071, >>>>> "sum": 7.301280372, >>>>> "avgtime": 0.000303322 >>>>> }, >>>>> "op_before_queue_op_lat": { >>>>> "avgcount": 57892986, >>>>> "sum": 1502.117718889, >>>>> "avgtime": 0.000025946 >>>>> }, >>>>> "op_before_dequeue_op_lat": { >>>>> "avgcount": 58091683, >>>>> "sum": 45194.453254037, >>>>> "avgtime": 0.000777984 >>>>> }, >>>>> "subop": 19784758, >>>>> "subop_in_bytes": 547174969754, >>>>> "subop_latency": { >>>>> "avgcount": 19784758, >>>>> "sum": 13019.714424060, >>>>> "avgtime": 0.000658067 >>>>> }, >>>>> "subop_w": 19784758, >>>>> "subop_w_in_bytes": 547174969754, >>>>> "subop_w_latency": { >>>>> "avgcount": 19784758, >>>>> "sum": 13019.714424060, >>>>> "avgtime": 0.000658067 >>>>> }, >>>>> "subop_pull": 0, >>>>> "subop_pull_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "subop_push": 0, >>>>> "subop_push_in_bytes": 0, >>>>> "subop_push_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "pull": 0, >>>>> "push": 2003, >>>>> "push_out_bytes": 5560009728, >>>>> "recovery_ops": 1940, >>>>> "loadavg": 118, >>>>> "buffer_bytes": 0, >>>>> "history_alloc_Mbytes": 0, >>>>> "history_alloc_num": 0, >>>>> "cached_crc": 0, >>>>> "cached_crc_adjusted": 0, >>>>> "missed_crc": 0, >>>>> "numpg": 243, >>>>> "numpg_primary": 82, >>>>> "numpg_replica": 161, >>>>> "numpg_stray": 0, >>>>> "numpg_removing": 0, >>>>> "heartbeat_to_peers": 10, >>>>> "map_messages": 7013, >>>>> "map_message_epochs": 7143, >>>>> "map_message_epoch_dups": 6315, >>>>> "messages_delayed_for_map": 0, >>>>> "osd_map_cache_hit": 203309, >>>>> "osd_map_cache_miss": 33, >>>>> "osd_map_cache_miss_low": 0, >>>>> "osd_map_cache_miss_low_avg": { >>>>> "avgcount": 0, >>>>> "sum": 0 >>>>> }, >>>>> "osd_map_bl_cache_hit": 47012, >>>>> "osd_map_bl_cache_miss": 1681, >>>>> "stat_bytes": 6401248198656, >>>>> "stat_bytes_used": 3777979072512, >>>>> "stat_bytes_avail": 2623269126144, >>>>> "copyfrom": 0, >>>>> "tier_promote": 0, >>>>> "tier_flush": 0, >>>>> "tier_flush_fail": 0, >>>>> "tier_try_flush": 0, >>>>> "tier_try_flush_fail": 0, >>>>> "tier_evict": 0, >>>>> "tier_whiteout": 1631, >>>>> "tier_dirty": 22360, >>>>> "tier_clean": 0, >>>>> "tier_delay": 0, >>>>> "tier_proxy_read": 0, >>>>> "tier_proxy_write": 0, >>>>> "agent_wake": 0, >>>>> "agent_skip": 0, >>>>> "agent_flush": 0, >>>>> "agent_evict": 0, >>>>> "object_ctx_cache_hit": 16311156, >>>>> "object_ctx_cache_total": 17426393, >>>>> "op_cache_hit": 0, >>>>> "osd_tier_flush_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "osd_tier_promote_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "osd_tier_r_lat": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "osd_pg_info": 30483113, >>>>> "osd_pg_fastinfo": 29619885, >>>>> "osd_pg_biginfo": 81703 >>>>> }, >>>>> "recoverystate_perf": { >>>>> "initial_latency": { >>>>> "avgcount": 243, >>>>> "sum": 6.869296500, >>>>> "avgtime": 0.028268709 >>>>> }, >>>>> "started_latency": { >>>>> "avgcount": 1125, >>>>> "sum": 13551384.917335850, >>>>> "avgtime": 12045.675482076 >>>>> }, >>>>> "reset_latency": { >>>>> "avgcount": 1368, >>>>> "sum": 1101.727799040, >>>>> "avgtime": 0.805356578 >>>>> }, >>>>> "start_latency": { >>>>> "avgcount": 1368, >>>>> "sum": 0.002014799, >>>>> "avgtime": 0.000001472 >>>>> }, >>>>> "primary_latency": { >>>>> "avgcount": 507, >>>>> "sum": 4575560.638823428, >>>>> "avgtime": 9024.774435549 >>>>> }, >>>>> "peering_latency": { >>>>> "avgcount": 550, >>>>> "sum": 499.372283616, >>>>> "avgtime": 0.907949606 >>>>> }, >>>>> "backfilling_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "waitremotebackfillreserved_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "waitlocalbackfillreserved_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "notbackfilling_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "repnotrecovering_latency": { >>>>> "avgcount": 1009, >>>>> "sum": 8975301.082274411, >>>>> "avgtime": 8895.243887288 >>>>> }, >>>>> "repwaitrecoveryreserved_latency": { >>>>> "avgcount": 420, >>>>> "sum": 99.846056520, >>>>> "avgtime": 0.237728706 >>>>> }, >>>>> "repwaitbackfillreserved_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "reprecovering_latency": { >>>>> "avgcount": 420, >>>>> "sum": 241.682764382, >>>>> "avgtime": 0.575435153 >>>>> }, >>>>> "activating_latency": { >>>>> "avgcount": 507, >>>>> "sum": 16.893347339, >>>>> "avgtime": 0.033320211 >>>>> }, >>>>> "waitlocalrecoveryreserved_latency": { >>>>> "avgcount": 199, >>>>> "sum": 672.335512769, >>>>> "avgtime": 3.378570415 >>>>> }, >>>>> "waitremoterecoveryreserved_latency": { >>>>> "avgcount": 199, >>>>> "sum": 213.536439363, >>>>> "avgtime": 1.073047433 >>>>> }, >>>>> "recovering_latency": { >>>>> "avgcount": 199, >>>>> "sum": 79.007696479, >>>>> "avgtime": 0.397023600 >>>>> }, >>>>> "recovered_latency": { >>>>> "avgcount": 507, >>>>> "sum": 14.000732748, >>>>> "avgtime": 0.027614857 >>>>> }, >>>>> "clean_latency": { >>>>> "avgcount": 395, >>>>> "sum": 4574325.900371083, >>>>> "avgtime": 11580.571899673 >>>>> }, >>>>> "active_latency": { >>>>> "avgcount": 425, >>>>> "sum": 4575107.630123680, >>>>> "avgtime": 10764.959129702 >>>>> }, >>>>> "replicaactive_latency": { >>>>> "avgcount": 589, >>>>> "sum": 8975184.499049954, >>>>> "avgtime": 15238.004242869 >>>>> }, >>>>> "stray_latency": { >>>>> "avgcount": 818, >>>>> "sum": 800.729455666, >>>>> "avgtime": 0.978886865 >>>>> }, >>>>> "getinfo_latency": { >>>>> "avgcount": 550, >>>>> "sum": 15.085667048, >>>>> "avgtime": 0.027428485 >>>>> }, >>>>> "getlog_latency": { >>>>> "avgcount": 546, >>>>> "sum": 3.482175693, >>>>> "avgtime": 0.006377611 >>>>> }, >>>>> "waitactingchange_latency": { >>>>> "avgcount": 39, >>>>> "sum": 35.444551284, >>>>> "avgtime": 0.908834648 >>>>> }, >>>>> "incomplete_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "down_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "getmissing_latency": { >>>>> "avgcount": 507, >>>>> "sum": 6.702129624, >>>>> "avgtime": 0.013219190 >>>>> }, >>>>> "waitupthru_latency": { >>>>> "avgcount": 507, >>>>> "sum": 474.098261727, >>>>> "avgtime": 0.935105052 >>>>> }, >>>>> "notrecovering_latency": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> }, >>>>> "rocksdb": { >>>>> "get": 28320977, >>>>> "submit_transaction": 30484924, >>>>> "submit_transaction_sync": 26371957, >>>>> "get_latency": { >>>>> "avgcount": 28320977, >>>>> "sum": 325.900908733, >>>>> "avgtime": 0.000011507 >>>>> }, >>>>> "submit_latency": { >>>>> "avgcount": 30484924, >>>>> "sum": 1835.888692371, >>>>> "avgtime": 0.000060222 >>>>> }, >>>>> "submit_sync_latency": { >>>>> "avgcount": 26371957, >>>>> "sum": 1431.555230628, >>>>> "avgtime": 0.000054283 >>>>> }, >>>>> "compact": 0, >>>>> "compact_range": 0, >>>>> "compact_queue_merge": 0, >>>>> "compact_queue_len": 0, >>>>> "rocksdb_write_wal_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "rocksdb_write_memtable_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "rocksdb_write_delay_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> }, >>>>> "rocksdb_write_pre_and_post_time": { >>>>> "avgcount": 0, >>>>> "sum": 0.000000000, >>>>> "avgtime": 0.000000000 >>>>> } >>>>> } >>>>> } >>>>> >>>>> ----- Mail original ----- >>>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>> Envoyé: Mardi 5 Février 2019 18:56:51 >>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>> >>>>> On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: >>>>>>>> but I don't see l_bluestore_fragmentation counter. >>>>>>>> (but I have bluestore_fragmentation_micros) >>>>>> ok, this is the same >>>>>> >>>>>> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", >>>>>> "How fragmented bluestore free space is (free extents / max possible number of free extents) * 1000"); >>>>>> >>>>>> >>>>>> Here a graph on last month, with bluestore_fragmentation_micros and latency, >>>>>> >>>>>> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png >>>>> hmm, so fragmentation grows eventually and drops on OSD restarts, isn't >>>>> it? The same for other OSDs? >>>>> >>>>> This proves some issue with the allocator - generally fragmentation >>>>> might grow but it shouldn't reset on restart. Looks like some intervals >>>>> aren't properly merged in run-time. >>>>> >>>>> On the other side I'm not completely sure that latency degradation is >>>>> caused by that - fragmentation growth is relatively small - I don't see >>>>> how this might impact performance that high. >>>>> >>>>> Wondering if you have OSD mempool monitoring (dump_mempools command >>>>> output on admin socket) reports? Do you have any historic data? >>>>> >>>>> If not may I have current output and say a couple more samples with >>>>> 8-12 hours interval? >>>>> >>>>> >>>>> Wrt to backporting bitmap allocator to mimic - we haven't had such plans >>>>> before that but I'll discuss this at BlueStore meeting shortly. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Igor >>>>> >>>>>> ----- Mail original ----- >>>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> >>>>>> À: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>>> Cc: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>> Envoyé: Lundi 4 Février 2019 16:04:38 >>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>> >>>>>> Thanks Igor, >>>>>> >>>>>>>> Could you please collect BlueStore performance counters right after OSD >>>>>>>> startup and once you get high latency. >>>>>>>> >>>>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. >>>>>> I'm already monitoring with >>>>>> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) >>>>>> >>>>>> but I don't see l_bluestore_fragmentation counter. >>>>>> >>>>>> (but I have bluestore_fragmentation_micros) >>>>>> >>>>>> >>>>>>>> Also if you're able to rebuild the code I can probably make a simple >>>>>>>> patch to track latency and some other internal allocator's paramter to >>>>>>>> make sure it's degraded and learn more details. >>>>>> Sorry, It's a critical production cluster, I can't test on it :( >>>>>> But I have a test cluster, maybe I can try to put some load on it, and try to reproduce. >>>>>> >>>>>> >>>>>> >>>>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus >>>>>>>> and try the difference... >>>>>> Any plan to backport it to mimic ? (But I can wait for Nautilus) >>>>>> perf results of new bitmap allocator seem very promising from what I've seen in PR. >>>>>> >>>>>> >>>>>> >>>>>> ----- Mail original ----- >>>>>> De: "Igor Fedotov" <ifedotov@xxxxxxx> >>>>>> À: "Alexandre Derumier" <aderumier@xxxxxxxxx>, "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx>, "Mark Nelson" <mnelson@xxxxxxxxxx> >>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>> Envoyé: Lundi 4 Février 2019 15:51:30 >>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>> >>>>>> Hi Alexandre, >>>>>> >>>>>> looks like a bug in StupidAllocator. >>>>>> >>>>>> Could you please collect BlueStore performance counters right after OSD >>>>>> startup and once you get high latency. >>>>>> >>>>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. >>>>>> >>>>>> Also if you're able to rebuild the code I can probably make a simple >>>>>> patch to track latency and some other internal allocator's paramter to >>>>>> make sure it's degraded and learn more details. >>>>>> >>>>>> >>>>>> More vigorous fix would be to backport bitmap allocator from Nautilus >>>>>> and try the difference... >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Igor >>>>>> >>>>>> >>>>>> On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: >>>>>>> Hi again, >>>>>>> >>>>>>> I speak too fast, the problem has occured again, so it's not tcmalloc cache size related. >>>>>>> >>>>>>> >>>>>>> I have notice something using a simple "perf top", >>>>>>> >>>>>>> each time I have this problem (I have seen exactly 4 times the same behaviour), >>>>>>> >>>>>>> when latency is bad, perf top give me : >>>>>>> >>>>>>> StupidAllocator::_aligned_len >>>>>>> and >>>>>>> btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempoo >>>>>>> l::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long >>>>>>> const, unsigned long>*>::increment_slow() >>>>>>> >>>>>>> (around 10-20% time for both) >>>>>>> >>>>>>> >>>>>>> when latency is good, I don't see them at all. >>>>>>> >>>>>>> >>>>>>> I have used the Mark wallclock profiler, here the results: >>>>>>> >>>>>>> http://odisoweb1.odiso.net/gdbpmp-ok.txt >>>>>>> >>>>>>> http://odisoweb1.odiso.net/gdbpmp-bad.txt >>>>>>> >>>>>>> >>>>>>> here an extract of the thread with btree::btree_iterator && StupidAllocator::_aligned_len >>>>>>> >>>>>>> >>>>>>> + 100.00% clone >>>>>>> + 100.00% start_thread >>>>>>> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() >>>>>>> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) >>>>>>> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) >>>>>>> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) >>>>>>> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) >>>>>>> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) >>>>>>> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) >>>>>>> | | + 68.00% ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>) >>>>>>> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>) >>>>>>> | | + 67.00% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>) >>>>>>> | | | + 67.00% BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*) >>>>>>> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*) >>>>>>> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) >>>>>>> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int) >>>>>>> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*) >>>>>>> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*) >>>>>>> | | | | | | + 64.00% StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned long*, unsigned int*) >>>>>>> | | | | | | + 34.00% btree::btree_iterator<btree::btree_node<btree::btree_map_params<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >, std::pair<unsigned long const, unsigned long>&, std::pair<unsigned long const, unsigned long>*>::increment_slow() >>>>>>> | | | | | | + 26.00% StupidAllocator::_aligned_len(interval_set<unsigned long, btree::btree_map<unsigned long, unsigned long, std::less<unsigned long>, mempool::pool_allocator<(mempool::pool_index_t)1, std::pair<unsigned long const, unsigned long> >, 256> >::iterator, unsigned long) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> De: "Alexandre Derumier" <aderumier@xxxxxxxxx> >>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>> Envoyé: Lundi 4 Février 2019 09:38:11 >>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> some news: >>>>>>> >>>>>>> I have tried with different transparent hugepage values (madvise, never) : no change >>>>>>> >>>>>>> I have tried to increase bluestore_cache_size_ssd to 8G: no change >>>>>>> >>>>>>> I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to be sure) >>>>>>> >>>>>>> >>>>>>> Note that this behaviour seem to happen really faster (< 2 days) on my big nvme drives (6TB), >>>>>>> my others clusters user 1,6TB ssd. >>>>>>> >>>>>>> Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by osd), but I'll try this week with 2osd by nvme, to see if it's helping. >>>>>>> >>>>>>> >>>>>>> BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 2.26 (which have also thread cache) ? >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Alexandre >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> De: "aderumier" <aderumier@xxxxxxxxx> >>>>>>> À: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:58:15 >>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>>> >>>>>>>>> Thanks. Is there any reason you monitor op_w_latency but not >>>>>>>>> op_r_latency but instead op_latency? >>>>>>>>> >>>>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? >>>>>>> I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). >>>>>>> >>>>>>> I just don't see latency difference on reads. (or they are very very small vs the write latency increase) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----- Mail original ----- >>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>>>>> Cc: "Sage Weil" <sage@xxxxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>> Envoyé: Mercredi 30 Janvier 2019 19:50:20 >>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: >>>>>>>> Hi Stefan, >>>>>>>> >>>>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc >>>>>>>>>> like suggested. This report makes me a little nervous about my change. >>>>>>>> Well,I'm really not sure that it's a tcmalloc bug. >>>>>>>> maybe bluestore related (don't have filestore anymore to compare) >>>>>>>> I need to compare with bigger latencies >>>>>>>> >>>>>>>> here an example, when all osd at 20-50ms before restart, then after restart (at 21:15), 1ms >>>>>>>> http://odisoweb1.odiso.net/latencybad.png >>>>>>>> >>>>>>>> I observe the latency in my guest vm too, on disks iowait. >>>>>>>> >>>>>>>> http://odisoweb1.odiso.net/latencybadvm.png >>>>>>>> >>>>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which >>>>>>>>>> exact values out of the daemon do you use for bluestore? >>>>>>>> here my influxdb queries: >>>>>>>> >>>>>>>> It take op_latency.sum/op_latency.avgcount on last second. >>>>>>>> >>>>>>>> >>>>>>>> SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) >>>>>>>> >>>>>>>> >>>>>>>> SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) >>>>>>>> >>>>>>>> >>>>>>>> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) >>>>>>> Thanks. Is there any reason you monitor op_w_latency but not >>>>>>> op_r_latency but instead op_latency? >>>>>>> >>>>>>> Also why do you monitor op_w_process_latency? but not op_r_process_latency? >>>>>>> >>>>>>> greets, >>>>>>> Stefan >>>>>>> >>>>>>>> ----- Mail original ----- >>>>>>>> De: "Stefan Priebe, Profihost AG" <s.priebe@xxxxxxxxxxxx> >>>>>>>> À: "aderumier" <aderumier@xxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx> >>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>>> Envoyé: Mercredi 30 Janvier 2019 08:45:33 >>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> here some new results, >>>>>>>>> different osd/ different cluster >>>>>>>>> >>>>>>>>> before osd restart latency was between 2-5ms >>>>>>>>> after osd restart is around 1-1.5ms >>>>>>>>> >>>>>>>>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) >>>>>>>>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) >>>>>>>>> http://odisoweb1.odiso.net/cephperf2/diff.txt >>>>>>>>> >>>>>>>>> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. >>>>>>>>> (I'm using tcmalloc 2.5-2.2) >>>>>>>> currently i'm in the process of switching back from jemalloc to tcmalloc >>>>>>>> like suggested. This report makes me a little nervous about my change. >>>>>>>> >>>>>>>> Also i'm currently only monitoring latency for filestore osds. Which >>>>>>>> exact values out of the daemon do you use for bluestore? >>>>>>>> >>>>>>>> I would like to check if i see the same behaviour. >>>>>>>> >>>>>>>> Greets, >>>>>>>> Stefan >>>>>>>> >>>>>>>>> ----- Mail original ----- >>>>>>>>> De: "Sage Weil" <sage@xxxxxxxxxxxx> >>>>>>>>> À: "aderumier" <aderumier@xxxxxxxxx> >>>>>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>>>>>>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 >>>>>>>>> Objet: Re: ceph osd commit latency increase over time, until restart >>>>>>>>> >>>>>>>>> Can you capture a perf top or perf record to see where teh CPU time is >>>>>>>>> going on one of the OSDs wth a high latency? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> sage >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have a strange behaviour of my osd, on multiple clusters, >>>>>>>>>> >>>>>>>>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, >>>>>>>>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup >>>>>>>>>> >>>>>>>>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. >>>>>>>>>> >>>>>>>>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy >>>>>>>>>> values like 20-200ms. >>>>>>>>>> >>>>>>>>>> Some example graphs: >>>>>>>>>> >>>>>>>>>> http://odisoweb1.odiso.net/osdlatency1.png >>>>>>>>>> http://odisoweb1.odiso.net/osdlatency2.png >>>>>>>>>> >>>>>>>>>> All osds have this behaviour, in all clusters. >>>>>>>>>> >>>>>>>>>> The latency of physical disks is ok. (Clusters are far to be full loaded) >>>>>>>>>> >>>>>>>>>> And if I restart the osd, the latency come back to 0,5-1ms. >>>>>>>>>> >>>>>>>>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? >>>>>>>>>> >>>>>>>>>> Any Hints for counters/logs to check ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Alexandre >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>> >>>>> >>>> >>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com