also, here the result of "perf diff 1mslatency.perfdata 3mslatency.perfdata" http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt ----- Mail original ----- De: "aderumier" <aderumier@xxxxxxxxx> À: "Sage Weil" <sage@xxxxxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Vendredi 25 Janvier 2019 17:32:02 Objet: Re: ceph osd commit latency increase over time, until restart Hi again, I was able to perf it today, before restart, commit latency was between 3-5ms after restart at 17:11, latency is around 1ms http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png here some perf reports: with 3ms latency: ----------------- perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt with 1ms latency ----------------- perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt I'll retry next week, trying to have bigger latency difference. Alexandre ----- Mail original ----- De: "aderumier" <aderumier@xxxxxxxxx> À: "Sage Weil" <sage@xxxxxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Vendredi 25 Janvier 2019 11:06:51 Objet: Re: ceph osd commit latency increase over time, until restart >>Can you capture a perf top or perf record to see where teh CPU time is >>going on one of the OSDs wth a high latency? Yes, sure. I'll do it next week and send result to the mailing list. Thanks Sage ! ----- Mail original ----- De: "Sage Weil" <sage@xxxxxxxxxxxx> À: "aderumier" <aderumier@xxxxxxxxx> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> Envoyé: Vendredi 25 Janvier 2019 10:49:02 Objet: Re: ceph osd commit latency increase over time, until restart Can you capture a perf top or perf record to see where teh CPU time is going on one of the OSDs wth a high latency? Thanks! sage On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: > > Hi, > > I have a strange behaviour of my osd, on multiple clusters, > > All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, > workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup > > When the osd are refreshly started, the commit latency is between 0,5-1ms. > > But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy > values like 20-200ms. > > Some example graphs: > > http://odisoweb1.odiso.net/osdlatency1.png > http://odisoweb1.odiso.net/osdlatency2.png > > All osds have this behaviour, in all clusters. > > The latency of physical disks is ok. (Clusters are far to be full loaded) > > And if I restart the osd, the latency come back to 0,5-1ms. > > That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? > > Any Hints for counters/logs to check ? > > > Regards, > > Alexandre > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com