performance latency analysis with LTTng

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[trying a 2nd time without attachments]

This is a follow-on to the discussion in the 12/3 performance meeting.   I wanted to share what we've done with IO path analysis and LTTng so far, I'd like to see if we can establish a general methodology we could use to go after different areas.  

We've been characterizing memstore IO path for a while, using VTune and other tools.  The idea with this tracing was to get a high level  a->b->c set of points and timings and drill down in areas which are larger contributors to latency.   Looking from the point of view of the client (i.e. rados bench), there's still some gaps in the latency being totaled up by the trace processing script.  Basically:


1)  Instrument code, install, run benchmark w/ lttng tracing enabled ............ qd=1, single osd, memstore, logging disabled, 4K writes .......

	Total time run:         20.001899
	Total writes made:      16136
	Write size:             4096
	Bandwidth (MB/sec):     3.151 

	Stddev Bandwidth:       0.721435
	Max bandwidth (MB/sec): 3.39453
	Min bandwidth (MB/sec): 0
	Average Latency:        0.00123722
	Stddev Latency:         0.000103216
	Max latency:            0.002637
	Min latency:            0.000401

2)  Grab the lttng trace: 

[09:34:23.661871204] (+0.000003133) cephtestnode1 ioctximpl:aio_write: { cpu_id = 12 }, { pthread_id = 140262776686656 }, { oid = "benchmark_data_cephtestnode1_37038_object16" } [09:34:23.661873274] (+0.000002070) cephtestnode1 ioctximpl:queue_aio_write: { cpu_id = 12 }, { pthread_id = 140262776686656 }, { num = 0 } .....
 [09:34:23.662166430] (+0.000206888) cephtestnode1 osd:ms_fast_dispatch: { cpu_id = 19 }, { pthread_id = 139986434324224 }, { oid = "benchmark_data_cephtestnode1_37038_object16", type = 8, num = 4107, tid = 17, inc = 0 } [09:34:23.662176713] (+0.000010283) cephtestnode1 oprequest:set_rmw_flags: { cpu_id = 19 }, { pthread_id = 139986434324224 }, { type = 8, num = 4107, tid = 17, inc = 0, flag = 0x4, old_rmw_flags = 0x0, new_rmw_flags = 0x4 } [09:34:23.662193699] (+0.000016986) cephtestnode1 osd:enqueue_op: { cpu_id = 19 }, { pthread_id = 139986434324224 }, { oid = "benchmark_data_cephtestnode1_37038_object16" } ......

3) Process it with a perl script, using the OID (traced in some spots) to track a single op.  When an op completes (hits the last trace point that we have instrumented) we grab all the time stamps for the previous tracepoints and gather the following info, which shows the thread ID, and the delta in time (microseconds) since the last trace.    Example:
	Op Transition list for benchmark_data_cephtestnode1_37038_object53, 18 states
	  - [ ThrID   us ]
	  - [ 86656 +0000] Start-> ioctximpl:aio_write:
	  - [ 86656 +0010]     -> ioctximpl:queue_aio_write:
	  - [ 24224 +0300]     -> osd:ms_fast_dispatch:
	  - [ 24224 +0011]     -> oprequest:set_rmw_flags:
	  - [ 24224 +0017]     -> osd:enqueue_op:
	  - [ 24224 +0006]     -> osd:shard_index_enqueue:
	  - [ 24224 +0013]     -> pg:queue_op:
	  - [ 94688 +0068]     -> osd:opwq_process_start:
	  - [ 94688 +0005]     -> osd:dequeue_op:
	  - [ 94688 +0003]     -> mark_flagpt:reached_pg
	  - [ 94688 +0049]     -> mark_flagpt:started
	  - [ 94688 +0005]     -> mark_flagpt:started_2
	  - [ 94688 +0005]     -> osd:prepare_tx_enter:
	  - [ 94688 +0003]     -> osd:do_osd_op_pre:
	  - [ 94688 +0002]     -> osd:do_osd_op_pre_write:
	  - [ 94688 +0015]     -> osd:do_osd_op_post:
	  - [ 94688 +0026]     -> osd:prepare_tx_exit:
	  - [ 94688 +0188]     -> osd:opwq_process_finish:
	  === > Total: 726 us


In order to see a trace of the entire run (vs just looking at 1 op or an average across thousands) I generated a CSV with all of the ops as well, and stuck them in a graph.  

The total time tracked by the script is about 750us and rados bench reports an average latency of about 1.2ms.  So there's still some missing parts, I believe a fair amount is in the commit being seen by the client.    The IOPS measured by the script (counting the 'process_finish' traces) match closely with what rados bench is reporting.  

We have a lot of stuff turned off by using single replica and memstore, but it seems the latency tracked here will still contribute to those configurations.   Any thoughts on how to better track a single op back to the client thread would be welcomed, as well as any other ways of doing stuff.. Thanks!

I've attached some files with more detail but I've had troubles in the past with attachments so not sure they'll go through.

Thanks,

Stephen
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux