Hi, currently I'm using telegraf + influxdb to monitor. Note that this bug seem to be only occur on writes, I don't have latency increase on read. counters are op_latency , op_w_latency, op_w_process_latency SELECT non_negative_derivative(first("op_latency.sum"), 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) SELECT non_negative_derivative(first("op_w_latency.sum"), 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) SELECT non_negative_derivative(first("op_w_process_latency.sum"), 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) dashboard is here: https://grafana.com/dashboards/7995 ----- Mail original ----- De: "Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> À: "aderumier" <aderumier@xxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Envoyé: Dimanche 27 Janvier 2019 12:11:42 Objet: RE: ceph osd commit latency increase over time, until restart Hi Alexandre, I was curious if I had a similar issue, what value are you monitoring? I have quite a lot to choose from. Bluestore.commitLat Bluestore.kvLat Bluestore.readLat Bluestore.readOnodeMetaLat Bluestore.readWaitAioLat Bluestore.stateAioWaitLat Bluestore.stateDoneLat Bluestore.stateIoDoneLat Bluestore.submitLat Bluestore.throttleLat Osd.opBeforeDequeueOpLat Osd.opRProcessLatency Osd.opWProcessLatency Osd.subopLatency Osd.subopWLatency Rocksdb.getLatency Rocksdb.submitLatency Rocksdb.submitSyncLatency RecoverystatePerf.repnotrecoveringLatency RecoverystatePerf.waitupthruLatency Osd.opRwPrepareLatency RecoverystatePerf.primaryLatency RecoverystatePerf.replicaactiveLatency RecoverystatePerf.startedLatency RecoverystatePerf.getlogLatency RecoverystatePerf.initialLatency RecoverystatePerf.recoveringLatency ThrottleBluestoreThrottleBytes.wait RecoverystatePerf.waitremoterecoveryreservedLatency -----Original Message----- From: Alexandre DERUMIER [mailto:aderumier@xxxxxxxxx] Sent: vrijdag 25 januari 2019 17:40 To: Sage Weil Cc: ceph-users; ceph-devel Subject: Re: ceph osd commit latency increase over time, until restart also, here the result of "perf diff 1mslatency.perfdata 3mslatency.perfdata" http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt ----- Mail original ----- De: "aderumier" <aderumier@xxxxxxxxx> À: "Sage Weil" <sage@xxxxxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Vendredi 25 Janvier 2019 17:32:02 Objet: Re: ceph osd commit latency increase over time, until restart Hi again, I was able to perf it today, before restart, commit latency was between 3-5ms after restart at 17:11, latency is around 1ms http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png here some perf reports: with 3ms latency: ----------------- perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt with 1ms latency ----------------- perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt I'll retry next week, trying to have bigger latency difference. Alexandre ----- Mail original ----- De: "aderumier" <aderumier@xxxxxxxxx> À: "Sage Weil" <sage@xxxxxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Vendredi 25 Janvier 2019 11:06:51 Objet: Re: ceph osd commit latency increase over time, until restart >>Can you capture a perf top or perf record to see where teh CPU time is >>going on one of the OSDs wth a high latency? Yes, sure. I'll do it next week and send result to the mailing list. Thanks Sage ! ----- Mail original ----- De: "Sage Weil" <sage@xxxxxxxxxxxx> À: "aderumier" <aderumier@xxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Vendredi 25 Janvier 2019 10:49:02 Objet: Re: ceph osd commit latency increase over time, until restart Can you capture a perf top or perf record to see where teh CPU time is going on one of the OSDs wth a high latency? Thanks! sage On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: > > Hi, > > I have a strange behaviour of my osd, on multiple clusters, > > All cluster are running mimic 13.2.1,bluestore, with ssd or nvme > drivers, workload is rbd only, with qemu-kvm vms running with librbd + > snapshot/rbd export-diff/snapshotdelete each day for backup > > When the osd are refreshly started, the commit latency is between 0,5-1ms. > > But overtime, this latency increase slowly (maybe around 1ms by day), > until reaching crazy values like 20-200ms. > > Some example graphs: > > http://odisoweb1.odiso.net/osdlatency1.png > http://odisoweb1.odiso.net/osdlatency2.png > > All osds have this behaviour, in all clusters. > > The latency of physical disks is ok. (Clusters are far to be full > loaded) > > And if I restart the osd, the latency come back to 0,5-1ms. > > That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? > > Any Hints for counters/logs to check ? > > > Regards, > > Alexandre > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com