> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Mark Nelson > Sent: 19 June 2015 13:44 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Ceph EC pool performance benchmarking, high > latencies. > > On 06/19/2015 07:28 AM, MATHIAS, Bryn (Bryn) wrote: > > Hi All, > > > > I am currently benchmarking CEPH to work out the correct read / write > model, to get the optimal cluster throughput and latency. > > > > For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised > name using the rados python interface. > > > > Load generation is happening on external machines. > > > > Write generation is characterised as the number of IOContexts and the > number of simultaneous async writes on those contexts. > > With one machine, IOContexts threads and 50 simultaneous writes per > context I achieve over 300 seconds: > > > > Percentile 5 = 0.133775639534 > > Percentile 10 = 0.178686833382 > > Percentile 15 = 0.180827605724 > > Percentile 20 = 0.185487747192 > > Percentile 25 = 0.229317903519 > > Percentile 30 = 0.23066740036 > > Percentile 35 = 0.232764816284 > > Percentile 40 = 0.278827047348 > > Percentile 45 = 0.280579996109 > > Percentile 50 = 0.283169865608 > > Percentile 55 = 0.329843044281 > > Percentile 60 = 0.332481050491 > > Percentile 65 = 0.380337607861 > > Percentile 70 = 0.428911447525 > > Percentile 75 = 0.438932359219 > > Percentile 80 = 0.530071306229 > > Percentile 85 = 0.597331762314 > > Percentile 90 = 0.735066819191 > > Percentile 95 = 1.08006491661 > > Percentile 100 = 11.7352428436 > > Max latancies = 11.7352428436, Min = 0.0499050617218, mean = > > 0.43913059745 Total objects writen = 24552 in time 302.979903936s > > gives 81.0350775118/s (324.140310047 MB/s) > > > > > > > > From two load generators on separate machines I achieve: > > > > > > Percentile 5 = 0.228541088104 > > Percentile 10 = 0.23213224411 > > Percentile 15 = 0.279508590698 > > Percentile 20 = 0.28137254715 > > Percentile 25 = 0.328829288483 > > Percentile 30 = 0.330499911308 > > Percentile 35 = 0.334045898914 > > Percentile 40 = 0.380131435394 > > Percentile 45 = 0.382810294628 > > Percentile 50 = 0.430188417435 > > Percentile 55 = 0.43399245739 > > Percentile 60 = 0.48120136261 > > Percentile 65 = 0.530511438847 > > Percentile 70 = 0.580485081673 > > Percentile 75 = 0.631661534309 > > Percentile 80 = 0.728989124298 > > Percentile 85 = 0.830820584297 > > Percentile 90 = 1.03238985538 > > Percentile 95 = 1.62925363779 > > Percentile 100 = 32.5414278507 > > Max latancies = 32.5414278507, Min = 0.0375339984894, mean = > > 0.863403101415 Total objects writen = 12714 in time 325.92741394s > > gives 39.0086855422/s (156.034742169 MB/s) > > > > > > Percentile 5 = 0.229072237015 > > Percentile 10 = 0.247376871109 > > Percentile 15 = 0.280901908875 > > Percentile 20 = 0.329082489014 > > Percentile 25 = 0.331234931946 > > Percentile 30 = 0.379406833649 > > Percentile 35 = 0.381390666962 > > Percentile 40 = 0.429595994949 > > Percentile 45 = 0.43164896965 > > Percentile 50 = 0.480262041092 > > Percentile 55 = 0.529169607162 > > Percentile 60 = 0.533170747757 > > Percentile 65 = 0.582635164261 > > Percentile 70 = 0.634325170517 > > Percentile 75 = 0.72939991951 > > Percentile 80 = 0.829002094269 > > Percentile 85 = 0.931713819504 > > Percentile 90 = 1.18014221191 > > Percentile 95 = 2.08048944473 > > Percentile 100 = 31.1357450485 > > Max latancies = 31.1357450485, Min = 0.0553231239319, mean = > > 1.03054529335 Total objects writen = 10769 in time 328.515608788s > > gives 32.7807863978/s (131.123145591 MB/s) > > > > Total = 278Mb/s > > > > > > The combined test has much higher latencies and a less than half > throughput per box. > > > > If I scale this up to 5 nodes all generating load I see the throughput drop to > ~50MB/s and latencies up to 60 seconds. > > > > > > An example slow write from dump_historic_ops is: > > > > "description": "osd_op(client.1892123.0:1525 > \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae > ack+ondisk+write+known_if_redirected e523)", > > "initiated_at": "2015-06-19 12:37:54.698848", > > "age": 578.438516, > > "duration": 38.399151, > > "type_data": [ > > "commit sent; apply or cleanup", > > { > > "client": "client.1892123", > > "tid": 1525 > > }, > > [ > > { > > "time": "2015-06-19 12:37:54.698848", > > "event": "initiated" > > }, > > { > > "time": "2015-06-19 12:37:54.856361", > > "event": "reached_pg" > > }, > > { > > "time": "2015-06-19 12:37:55.095731", > > "event": "started" > > }, > > { > > "time": "2015-06-19 12:37:55.103645", > > "event": "started" > > }, > > { > > "time": "2015-06-19 12:37:55.104125", > > "event": "commit_queued_for_journal_write" > > }, > > { > > "time": "2015-06-19 12:37:55.104900", > > "event": "write_thread_in_journal_buffer" > > }, > > { > > "time": "2015-06-19 12:37:55.106112", > > "event": "journaled_completion_queued" > > }, > > { > > "time": "2015-06-19 12:37:55.107065", > > "event": "sub_op_committed" > > }, > > { > > "time": "2015-06-19 12:37:55.117510", > > "event": "commit_sent" > > }, > > { > > "time": "2015-06-19 12:37:55.187676", > > "event": "sub_op_applied" > > }, > > { > > "time": "2015-06-19 12:38:33.097998", > > "event": "done" > > } > > ] > > ] > > } > > ] > > } > > > > There is a very large wait time between 'sub_op_applied' and 'done' > > > > > > My ceph cluster is made up of 5 machines, with 13 OSDs each. 10Gb > > links between each of the ceph nodes on both the public and private > > networks > > 5 Load generation blades connected with a 20Gb aggregate connection > > When you say 20Gb/s aggregate, does that mean you are using some kind of > bonding config? > > > Watching dstat on these machines whilst the tests are running shows they > aren’t being highly stressed. > > > > I have managed to push a lot more though the system, but not when > worrying about latency, to do this I tend to run one IOcontext and > continuously fire aio_write_full operations on it. > > > > > > Does anyone have any suggestion for how to get better performance out > of my cluster? > Just out of interest do any of your journals or disks look like they are getting maxed out? Your latency breakdown seems to indicate that the bulk of requests are being serviced in reasonable time, but around 5% (or less) are taking excessively long for some reason. I'm wondering if something is causing a backlog on the journals? In regards to your query about the delay between the applied and done I wonder if this is just showing the time that the data was sitting in journal waiting to be flushed? Once the write has been committed by the journal, from a clients perspective the write is complete. > Regarding latency, there's certainly some dark corners in the ceph code itself, > but this seems like something else to me. > > One thing that's bitten us a couple of times is the behavior of the network > during all-to-all communication. With erasure coding it's especially important > because you have more machines involved in every operation and you're > latency is only as fast as the slowest connection. > It might be worth running some all-to-all network communication tests. > We've talked about writing something like this right into ceph (or at least > into cbt). > > Here's a script you might want to hack up. This is probably what I'll use as a > base for integrating network and disk tests into cbt. The iperf part is what > you'd primarily be interested in. Be careful with those fio tests, they can be > destructive! > > #!/bin/bash > archive_dir=$1 > t=120 > tmp_dir="/tmp/iperf_tests" > start_host=129 > end_host=160 > hosts="foo[$start_host-$end_host].domain.com" > fio_size="4M" > fio_rw="randread" > > mkdir -p $archive_dir > pdsh -R ssh -w $hosts "killall -9 iperf" > pdsh -R ssh -w $hosts "killall -9 fio" > pdsh -R ssh -w $hosts "iperf -s" & > pdsh -R ssh -w $hosts "rm -rf $tmp_dir" > pdsh -R ssh -w $hosts "mkdir -p $tmp_dir" > > case "$2" in > > "fio") > mkdir -p $archive_dir/fio_only_${fio_size}${fio_rw} > pdsh -R ssh -w $hosts "mkdir -p $tmp_dir/fio_only" > pdsh -R ssh -w $hosts -f 64 "sudo fio --rw=$fio_rw --ioengine=libaio > --numjobs=1 --direct=1 --runtime=$t --bs=$fio_size --iodepth=16 --name > /dev/sdb --name=/dev/sdc --name=/dev/sdd --output > $tmp_dir/fio_only/output" > rpdcp -R ssh -w $hosts $tmp_dir/fio_only/output* > $archive_dir/fio_only_${fio_size}${fio_rw} > ;; > > "iperf") > mkdir -p $archive_dir/iperf_only > pdsh -R ssh -w $hosts "mkdir -p $tmp_dir/iperf_only" > > for val in $(eval echo {$start_host..$end_host}) do > pdsh -R ssh -w $hosts -f 64 "iperf -c foo${val}.domain.com -f m -t $t -P 4 > "'>'" $tmp_dir/iperf_only/"'`hostname -s`'"_to_foo${val}.out" & done sleep > $t rpdcp -R ssh -w $hosts $tmp_dir/iperf_only/*.out $archive_dir/iperf_only > ;; esac > > > > > > > > Bryn > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com