Re: Ceph EC pool performance benchmarking, high latencies.

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 19 Jun 2015 07:44:18 -0500

On 06/19/2015 07:28 AM, MATHIAS, Bryn (Bryn) wrote:
Hi All,

I am currently benchmarking CEPH to work out the correct read / write model, to get the optimal cluster throughput and latency.

For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised name using the rados python interface.

Load generation is happening on external machines.

Write generation is characterised as the number of IOContexts and the number of simultaneous async writes on those contexts.
With one machine, IOContexts threads and 50 simultaneous writes per context I achieve over 300 seconds:

Percentile 5 = 0.133775639534
Percentile 10 = 0.178686833382
Percentile 15 = 0.180827605724
Percentile 20 = 0.185487747192
Percentile 25 = 0.229317903519
Percentile 30 = 0.23066740036
Percentile 35 = 0.232764816284
Percentile 40 = 0.278827047348
Percentile 45 = 0.280579996109
Percentile 50 = 0.283169865608
Percentile 55 = 0.329843044281
Percentile 60 = 0.332481050491
Percentile 65 = 0.380337607861
Percentile 70 = 0.428911447525
Percentile 75 = 0.438932359219
Percentile 80 = 0.530071306229
Percentile 85 = 0.597331762314
Percentile 90 = 0.735066819191
Percentile 95 = 1.08006491661
Percentile 100 = 11.7352428436
Max latancies = 11.7352428436, Min = 0.0499050617218, mean = 0.43913059745
Total objects writen = 24552 in time 302.979903936s gives 81.0350775118/s (324.140310047 MB/s)

 From two load generators on separate machines I achieve:

Percentile 5 = 0.228541088104
Percentile 10 = 0.23213224411
Percentile 15 = 0.279508590698
Percentile 20 = 0.28137254715
Percentile 25 = 0.328829288483
Percentile 30 = 0.330499911308
Percentile 35 = 0.334045898914
Percentile 40 = 0.380131435394
Percentile 45 = 0.382810294628
Percentile 50 = 0.430188417435
Percentile 55 = 0.43399245739
Percentile 60 = 0.48120136261
Percentile 65 = 0.530511438847
Percentile 70 = 0.580485081673
Percentile 75 = 0.631661534309
Percentile 80 = 0.728989124298
Percentile 85 = 0.830820584297
Percentile 90 = 1.03238985538
Percentile 95 = 1.62925363779
Percentile 100 = 32.5414278507
Max latancies = 32.5414278507, Min = 0.0375339984894, mean = 0.863403101415
Total objects writen = 12714 in time 325.92741394s gives 39.0086855422/s (156.034742169 MB/s)

Percentile 5 = 0.229072237015
Percentile 10 = 0.247376871109
Percentile 15 = 0.280901908875
Percentile 20 = 0.329082489014
Percentile 25 = 0.331234931946
Percentile 30 = 0.379406833649
Percentile 35 = 0.381390666962
Percentile 40 = 0.429595994949
Percentile 45 = 0.43164896965
Percentile 50 = 0.480262041092
Percentile 55 = 0.529169607162
Percentile 60 = 0.533170747757
Percentile 65 = 0.582635164261
Percentile 70 = 0.634325170517
Percentile 75 = 0.72939991951
Percentile 80 = 0.829002094269
Percentile 85 = 0.931713819504
Percentile 90 = 1.18014221191
Percentile 95 = 2.08048944473
Percentile 100 = 31.1357450485
Max latancies = 31.1357450485, Min = 0.0553231239319, mean = 1.03054529335
Total objects writen = 10769 in time 328.515608788s gives 32.7807863978/s (131.123145591 MB/s)

Total = 278Mb/s

The combined test has much higher latencies and a less than half throughput per box.

If I scale this up to 5 nodes all generating load I see the throughput drop to ~50MB/s and latencies up to 60 seconds.

An example slow write from dump_historic_ops is:

             "description": "osd_op(client.1892123.0:1525 \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae ack+ondisk+write+known_if_redirected e523)",
             "initiated_at": "2015-06-19 12:37:54.698848",
             "age": 578.438516,
             "duration": 38.399151,
             "type_data": [
                 "commit sent; apply or cleanup",
                 {
                     "client": "client.1892123",
                     "tid": 1525
                 },
                 [
                     {
                         "time": "2015-06-19 12:37:54.698848",
                         "event": "initiated"
                     },
                     {
                         "time": "2015-06-19 12:37:54.856361",
                         "event": "reached_pg"
                     },
                     {
                         "time": "2015-06-19 12:37:55.095731",
                         "event": "started"
                     },
                     {
                         "time": "2015-06-19 12:37:55.103645",
                         "event": "started"
                     },
                     {
                         "time": "2015-06-19 12:37:55.104125",
                         "event": "commit_queued_for_journal_write"
                     },
                     {
                         "time": "2015-06-19 12:37:55.104900",
                         "event": "write_thread_in_journal_buffer"
                     },
                     {
                         "time": "2015-06-19 12:37:55.106112",
                         "event": "journaled_completion_queued"
                     },
                     {
                         "time": "2015-06-19 12:37:55.107065",
                         "event": "sub_op_committed"
                     },
                     {
                         "time": "2015-06-19 12:37:55.117510",
                         "event": "commit_sent"
                     },
                     {
                         "time": "2015-06-19 12:37:55.187676",
                         "event": "sub_op_applied"
                     },
                     {
                         "time": "2015-06-19 12:38:33.097998",
                         "event": "done"
                     }
                 ]
             ]
         }
     ]
}

There is a very large wait time between 'sub_op_applied' and 'done'

My ceph cluster is made up of 5 machines, with 13 OSDs each. 10Gb links between each of the ceph nodes on both the public and private networks
5 Load generation blades connected with a 20Gb aggregate connection

When you say 20Gb/s aggregate, does that mean you are using some kind of 
bonding config?

Watching dstat on these machines whilst the tests are running shows they aren’t being highly stressed.

I have managed to push a lot more though the system, but not when worrying about latency, to do this I tend to run one IOcontext and continuously fire aio_write_full operations on it.

Does anyone have any suggestion for how to get better performance out of my cluster?

Regarding latency, there's certainly some dark corners in the ceph code 
itself, but this seems like something else to me.

One thing that's bitten us a couple of times is the behavior of the 
network during all-to-all communication.  With erasure coding it's 
especially important because you have more machines involved in every 
operation and you're latency is only as fast as the slowest connection. 
 It might be worth running some all-to-all network communication tests. 
 We've talked about writing something like this right into ceph (or at 
least into cbt).

Here's a script you might want to hack up.  This is probably what I'll 
use as a base for integrating network and disk tests into cbt.  The 
iperf part is what you'd primarily be interested in.  Be careful with 
those fio tests, they can be destructive!

#!/bin/bash
archive_dir=$1
t=120
tmp_dir="/tmp/iperf_tests"
start_host=129
end_host=160
hosts="foo[$start_host-$end_host].domain.com"
fio_size="4M"
fio_rw="randread"

mkdir -p $archive_dir
pdsh -R ssh -w $hosts "killall -9 iperf"
pdsh -R ssh -w $hosts "killall -9 fio"
pdsh -R ssh -w $hosts "iperf -s" &
pdsh -R ssh -w $hosts "rm -rf $tmp_dir"
pdsh -R ssh -w $hosts "mkdir -p $tmp_dir"

case "$2" in

"fio")
mkdir -p $archive_dir/fio_only_${fio_size}${fio_rw}
pdsh -R ssh -w $hosts "mkdir -p $tmp_dir/fio_only"
pdsh -R ssh -w $hosts -f 64 "sudo fio --rw=$fio_rw --ioengine=libaio 
--numjobs=1 --direct=1 --runtime=$t --bs=$fio_size --iodepth=16 --name 
/dev/sdb --name=/dev/sdc --name=/dev/sdd --output $tmp_dir/fio_only/output"
rpdcp -R ssh -w $hosts $tmp_dir/fio_only/output* 
$archive_dir/fio_only_${fio_size}${fio_rw}
;;

"iperf")
mkdir -p $archive_dir/iperf_only
pdsh -R ssh -w $hosts "mkdir -p $tmp_dir/iperf_only"

for val in $(eval echo {$start_host..$end_host})
do
  pdsh -R ssh -w $hosts -f 64 "iperf -c foo${val}.domain.com -f m -t $t 
-P 4 "'>'" $tmp_dir/iperf_only/"'`hostname -s`'"_to_foo${val}.out" &
done
sleep $t
rpdcp -R ssh -w $hosts $tmp_dir/iperf_only/*.out $archive_dir/iperf_only
;;
esac

Bryn
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com