Re: Proposition - latency histogram

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 28 Nov 2016 16:51:38 +0000 (UTC)

On Mon, 28 Nov 2016, Bartłomiej Święcki wrote:
> Hi,
> 
> Currently we can query OSD for op latency but it's given as an average. 
> Average may not give the bets information in this case - i.e. spikes can 
> easily get hidden there.
> 
> Instead of an average we could easily do a simple histogram - quantize 
> the latency into predefined set of time intervals, for each of them have 
> a simple performance counter, at each op increase one of them. Since 
> those are per OSD, we could have pretty high resolution with fractional 
> memory usage, performance impact should be negligible since only one 
> (two if split into read and write) of those counters would be 
> incremented per one osd op.
> 
> In addition we could also do this in 2D - each counter matching given 
> latency range and op size range. having such 2D table would show both 
> latency histogram, request size histogram and combinations of those 
> (i.e. latency histogram of ~4k ops only).
> 
> What do you think about this idea? I can prepare some code - a simple proof of
> concept looks really
> straightforward to implement.

This sounds like a great idea.  I think the main issue is that the data 
won't be easily exposed via the perfcounter interface... at least not in a 
way that generic tools can visualize.  Unless there is a standardish way 
to report histogram metrics?

sage