Re: ceph osd commit latency increase over time, until restart

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Wed, 30 Jan 2019 08:45:33 +0100

Hi,

Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER:
> Hi,
> 
> here some new results,
> different osd/ different cluster
> 
> before osd restart latency was between 2-5ms
> after osd restart is around 1-1.5ms
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt  (2-5ms)
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
> http://odisoweb1.odiso.net/cephperf2/diff.txt
> 
> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong.
> (I'm using tcmalloc 2.5-2.2)

currently i'm in the process of switching back from jemalloc to tcmalloc
like suggested. This report makes me a little nervous about my change.

Also i'm currently only monitoring latency for filestore osds. Which
exact values out of the daemon do you use for bluestore?

I would like to check if i see the same behaviour.

Greets,
Stefan

> 
> ----- Mail original -----
> De: "Sage Weil" <sage@xxxxxxxxxxxx>
> À: "aderumier" <aderumier@xxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
> Envoyé: Vendredi 25 Janvier 2019 10:49:02
> Objet: Re: ceph osd commit latency increase over time, until restart
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
>>
>> Hi, 
>>
>> I have a strange behaviour of my osd, on multiple clusters, 
>>
>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>
>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>
>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>> values like 20-200ms. 
>>
>> Some example graphs: 
>>
>> http://odisoweb1.odiso.net/osdlatency1.png 
>> http://odisoweb1.odiso.net/osdlatency2.png 
>>
>> All osds have this behaviour, in all clusters. 
>>
>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>
>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>
>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>
>> Any Hints for counters/logs to check ? 
>>
>>
>> Regards, 
>>
>> Alexandre 
>>
>>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com