RE: performance latency analysis with LTTng

"Blinick, Stephen L" <stephen.l.blinick@xxxxxxxxx> · Wed, 17 Dec 2014 07:03:36 +0000

Hi Greg,  thanks for reviewing my email & the feedback, especially the high level list.  The ~300us chunk of time in my data I knew included the serialization & deserialization, but didn't realize it also includes the crushmap calculation.    I don't think I'm hitting the message throttles, I have when using a real backend (vs memstore) but I think the latency is sufficiently low that I don't get enough bytes in flight.  I will double check that.

The intent is to cover the IO path length from start to finish and drill into the top latency items.  I think by tracing OID in librados and using tid in the OSD I can link the traces together and get the additional detail.  I'll update with the details from your points 1-6.    

Either way, it does seem like #1-8 represents about 1/4th of the indicated latency with an osd using memstore.

Thanks,

Stephen

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: Thursday, December 11, 2014 11:37 AM
To: Blinick, Stephen L
Cc: Andreas Bluemle; Ceph Development
Subject: Re: performance latency analysis with LTTng

On Wed, Dec 10, 2014 at 10:54 PM, Blinick, Stephen L <stephen.l.blinick@xxxxxxxxx> wrote:
> Hi Andreas,  Thanks for providing this info last week.  I've reviewed the data, the annotated spreadsheet is very helpful to interpret the results.    I do think we're seeing similar measurements even though I'm characterizing a more simplified environment.  The key differences are that I'm running with no replication (pool has size=1) and memstore backend.  The client (smallio_bench) in this environment, with a single OSD, qd of 1 (t=1), is seeing a latency on average of about 1200us.  My goal was to account for 100% of this latency via traces in order to determine where to focus  (with other tools, profilers, etc).  So far I'm covering about 750us of the 1200us, missing the time the commit message takes to be dispatched (received) by the client.
>
> In order to do this, I had to run the client on the same node as the OSD, and Vishal Verma on my team added some additional traces to the librados code.  In queue_aio_write, I don't think there's a transaction ID yet so we traced the OID in several places, and then used a script written in perl to attempt to correlate things and process the results.
>
> Beyond that, I think the analysis data is very similar to yours, I need to annotate it. I opted to trace out the time as "delta since last trace" instead of time since start of transaction, and so the entries with higher elapsed times indicate whatever was between those traces took a long time, not necessarily the code at the tracepoint.
>
> I haven't yet accounted for the entire time seen by the client.  From the time the aio_write is queued, until ms_fast_dispatch, I have about 300us of time.. quite large.   I didn't trace the pipe:reader/writer, and I see for you that's ~90us of time.     So that would cover about a third of it.    Additionally, I would like to be able to trace the dispatch of the commit ack (complete_aio_write) but at that point I don't have an OID or tid to correlate to.
>
> One question about the tracing code:  I am a bit concerned about using static sized arrays on the stack to populate with timing..  The reason is, any change by adding a tracepoint to a function would require updating a #define for the index, the array size, and the tracepoint (3 spots).  If one of those is missed it could result in a stack overwrite.  Is the reason for this to reduce the # of traces, or is it to make correlating times easier?
>
> Anyone have any ideas about the relatively high time spent between the client & pipe reader in the OSD?

It sounds like you're not measuring any of the time between giving the write to the client library and when the OSD finishes receiving it. I don't have any good intuition about what parts of those take the most time, but some things that have to happen:
1) librados runs CRUSH calculation on the object
2) librados serializes write operation into a message
3) librados writes into kernel networking stack
4) networking stack does stuff to send
5) networking stack does stuff to receive
6) OSD messenger reads in message
7) OSD messenger deserializes message
8) OSD messenger fast dispatches message

So, lots of that could take up some time. Some of it is covered by throttlers that will block if there's too much data in-flight at any given moment.
-Greg
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f