Re: ceph osd commit latency increase over time, until restart

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Wed, 30 Jan 2019 19:55:28 +0100 (CET)

>>If it does, probably only by accident. :) The autotuner in master is 
>>pretty dumb and mostly just grows/shrinks the caches based on the 
>>default ratios but accounts for the memory needed for rocksdb 
>>indexes/filters. It will try to keep the total OSD memory consumption 
>>below the specified limit. It doesn't do anything smart like monitor 
>>whether or not large caches may introduce more latency than small 
>>caches. It actually adds a small amount of additional overhead in the 
>>mempool thread to perform the calculations. If you had a static 
>>workload and tuned the bluestore cache size and ratios perfectly it 
>>would only add extra (albeit fairly minimal with the default settings) 
>>computational cost.

Ok, thanks for the explain !

>>If perf isn't showing anything conclusive, you might try my wallclock 
>>profiler: http://github.com/markhpc/gdbpmp 

I'll try, thanks

>>Some other things to watch out for are CPUs switching C states 

for cpu, c-state are disabled, cpu is running always at max frequency
(intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1)

and the 
>>effect of having transparent huge pages enabled (though I'd be more 
>>concerned about this in terms of memory usage). 

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

(also server have only 1 socket, so no numa problem)

----- Mail original -----
De: "Mark Nelson" <mnelson@xxxxxxxxxx>
À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Mercredi 30 Janvier 2019 18:08:08
Objet: Re:  ceph osd commit latency increase over time, until restart

On 1/30/19 7:45 AM, Alexandre DERUMIER wrote: 
>>> I don't see any smoking gun here... :/ 
> I need to test to compare when latency are going very high, but I need to wait more days/weeks. 
> 
> 
>>> The main difference between a warm OSD and a cold one is that on startup 
>>> the bluestore cache is empty. You might try setting the bluestore cache 
>>> size to something much smaller and see if that has an effect on the CPU 
>>> utilization? 
> I will try to test. I also wonder if the new auto memory tuning from Mark could help too ? 
> (I'm still on mimic 13.2.1, planning to update to 13.2.5 next month) 
> 
> also, could check some bluestore related counters ? (onodes, rocksdb,bluestore cache....) 

If it does, probably only by accident. :) The autotuner in master is 
pretty dumb and mostly just grows/shrinks the caches based on the 
default ratios but accounts for the memory needed for rocksdb 
indexes/filters. It will try to keep the total OSD memory consumption 
below the specified limit. It doesn't do anything smart like monitor 
whether or not large caches may introduce more latency than small 
caches. It actually adds a small amount of additional overhead in the 
mempool thread to perform the calculations. If you had a static 
workload and tuned the bluestore cache size and ratios perfectly it 
would only add extra (albeit fairly minimal with the default settings) 
computational cost. 

If perf isn't showing anything conclusive, you might try my wallclock 
profiler: http://github.com/markhpc/gdbpmp 

Some other things to watch out for are CPUs switching C states and the 
effect of having transparent huge pages enabled (though I'd be more 
concerned about this in terms of memory usage). 

Mark 

> 
>>> Note that this doesn't necessarily mean that's what you want. Maybe the 
>>> reason why the CPU utilization is higher is because the cache is warm and 
>>> the OSD is serving more requests per second... 
> Well, currently, the server is really quiet 
> 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util 
> nvme0n1 2,00 515,00 48,00 1182,00 304,00 11216,00 18,73 0,01 0,00 0,00 0,00 0,01 1,20 
> 
> %Cpu(s): 1,5 us, 1,0 sy, 0,0 ni, 97,2 id, 0,2 wa, 0,0 hi, 0,1 si, 0,0 st 
> 
> And this is only with writes, not reads 
> 
> 
> 
> ----- Mail original ----- 
> De: "Sage Weil" <sage@xxxxxxxxxxxx> 
> À: "aderumier" <aderumier@xxxxxxxxx> 
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
> Envoyé: Mercredi 30 Janvier 2019 14:33:23 
> Objet: Re: ceph osd commit latency increase over time, until restart 
> 
> On Wed, 30 Jan 2019, Alexandre DERUMIER wrote: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
> I don't see any smoking gun here... :/ 
> 
> The main difference between a warm OSD and a cold one is that on startup 
> the bluestore cache is empty. You might try setting the bluestore cache 
> size to something much smaller and see if that has an effect on the CPU 
> utilization? 
> 
> Note that this doesn't necessarily mean that's what you want. Maybe the 
> reason why the CPU utilization is higher is because the cache is warm and 
> the OSD is serving more requests per second... 
> 
> sage 
> 
> 
> 
>> >From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm wrong. 
>> 
>> (I'm using tcmalloc 2.5-2.2) 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Sage Weil" <sage@xxxxxxxxxxxx> 
>> À: "aderumier" <aderumier@xxxxxxxxx> 
>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ceph osd commit latency increase over time, until restart 
>> 
>> Can you capture a perf top or perf record to see where teh CPU time is 
>> going on one of the OSDs wth a high latency? 
>> 
>> Thanks! 
>> sage 
>> 
>> 
>> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
>> 
>>> Hi, 
>>> 
>>> I have a strange behaviour of my osd, on multiple clusters, 
>>> 
>>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd export-diff/snapshotdelete each day for backup 
>>> 
>>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>>> 
>>> But overtime, this latency increase slowly (maybe around 1ms by day), until reaching crazy 
>>> values like 20-200ms. 
>>> 
>>> Some example graphs: 
>>> 
>>> http://odisoweb1.odiso.net/osdlatency1.png 
>>> http://odisoweb1.odiso.net/osdlatency2.png 
>>> 
>>> All osds have this behaviour, in all clusters. 
>>> 
>>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>>> 
>>> And if I restart the osd, the latency come back to 0,5-1ms. 
>>> 
>>> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory bug ? 
>>> 
>>> Any Hints for counters/logs to check ? 
>>> 
>>> 
>>> Regards, 
>>> 
>>> Alexandre 
>>> 
>>> 
>> 
>> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________ 
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com