Hi All, I am currently benchmarking CEPH to work out the correct read / write model, to get the optimal cluster throughput and latency. For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised name using the rados python interface. Load generation is happening on external machines. Write generation is characterised as the number of IOContexts and the number of simultaneous async writes on those contexts. With one machine, IOContexts threads and 50 simultaneous writes per context I achieve over 300 seconds: Percentile 5 = 0.133775639534 Percentile 10 = 0.178686833382 Percentile 15 = 0.180827605724 Percentile 20 = 0.185487747192 Percentile 25 = 0.229317903519 Percentile 30 = 0.23066740036 Percentile 35 = 0.232764816284 Percentile 40 = 0.278827047348 Percentile 45 = 0.280579996109 Percentile 50 = 0.283169865608 Percentile 55 = 0.329843044281 Percentile 60 = 0.332481050491 Percentile 65 = 0.380337607861 Percentile 70 = 0.428911447525 Percentile 75 = 0.438932359219 Percentile 80 = 0.530071306229 Percentile 85 = 0.597331762314 Percentile 90 = 0.735066819191 Percentile 95 = 1.08006491661 Percentile 100 = 11.7352428436 Max latancies = 11.7352428436, Min = 0.0499050617218, mean = 0.43913059745 Total objects writen = 24552 in time 302.979903936s gives 81.0350775118/s (324.140310047 MB/s) >From two load generators on separate machines I achieve: Percentile 5 = 0.228541088104 Percentile 10 = 0.23213224411 Percentile 15 = 0.279508590698 Percentile 20 = 0.28137254715 Percentile 25 = 0.328829288483 Percentile 30 = 0.330499911308 Percentile 35 = 0.334045898914 Percentile 40 = 0.380131435394 Percentile 45 = 0.382810294628 Percentile 50 = 0.430188417435 Percentile 55 = 0.43399245739 Percentile 60 = 0.48120136261 Percentile 65 = 0.530511438847 Percentile 70 = 0.580485081673 Percentile 75 = 0.631661534309 Percentile 80 = 0.728989124298 Percentile 85 = 0.830820584297 Percentile 90 = 1.03238985538 Percentile 95 = 1.62925363779 Percentile 100 = 32.5414278507 Max latancies = 32.5414278507, Min = 0.0375339984894, mean = 0.863403101415 Total objects writen = 12714 in time 325.92741394s gives 39.0086855422/s (156.034742169 MB/s) Percentile 5 = 0.229072237015 Percentile 10 = 0.247376871109 Percentile 15 = 0.280901908875 Percentile 20 = 0.329082489014 Percentile 25 = 0.331234931946 Percentile 30 = 0.379406833649 Percentile 35 = 0.381390666962 Percentile 40 = 0.429595994949 Percentile 45 = 0.43164896965 Percentile 50 = 0.480262041092 Percentile 55 = 0.529169607162 Percentile 60 = 0.533170747757 Percentile 65 = 0.582635164261 Percentile 70 = 0.634325170517 Percentile 75 = 0.72939991951 Percentile 80 = 0.829002094269 Percentile 85 = 0.931713819504 Percentile 90 = 1.18014221191 Percentile 95 = 2.08048944473 Percentile 100 = 31.1357450485 Max latancies = 31.1357450485, Min = 0.0553231239319, mean = 1.03054529335 Total objects writen = 10769 in time 328.515608788s gives 32.7807863978/s (131.123145591 MB/s) Total = 278Mb/s The combined test has much higher latencies and a less than half throughput per box. If I scale this up to 5 nodes all generating load I see the throughput drop to ~50MB/s and latencies up to 60 seconds. An example slow write from dump_historic_ops is: "description": "osd_op(client.1892123.0:1525 \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae ack+ondisk+write+known_if_redirected e523)", "initiated_at": "2015-06-19 12:37:54.698848", "age": 578.438516, "duration": 38.399151, "type_data": [ "commit sent; apply or cleanup", { "client": "client.1892123", "tid": 1525 }, [ { "time": "2015-06-19 12:37:54.698848", "event": "initiated" }, { "time": "2015-06-19 12:37:54.856361", "event": "reached_pg" }, { "time": "2015-06-19 12:37:55.095731", "event": "started" }, { "time": "2015-06-19 12:37:55.103645", "event": "started" }, { "time": "2015-06-19 12:37:55.104125", "event": "commit_queued_for_journal_write" }, { "time": "2015-06-19 12:37:55.104900", "event": "write_thread_in_journal_buffer" }, { "time": "2015-06-19 12:37:55.106112", "event": "journaled_completion_queued" }, { "time": "2015-06-19 12:37:55.107065", "event": "sub_op_committed" }, { "time": "2015-06-19 12:37:55.117510", "event": "commit_sent" }, { "time": "2015-06-19 12:37:55.187676", "event": "sub_op_applied" }, { "time": "2015-06-19 12:38:33.097998", "event": "done" } ] ] } ] } There is a very large wait time between 'sub_op_applied' and 'done' My ceph cluster is made up of 5 machines, with 13 OSDs each. 10Gb links between each of the ceph nodes on both the public and private networks 5 Load generation blades connected with a 20Gb aggregate connection Watching dstat on these machines whilst the tests are running shows they aren’t being highly stressed. I have managed to push a lot more though the system, but not when worrying about latency, to do this I tend to run one IOcontext and continuously fire aio_write_full operations on it. Does anyone have any suggestion for how to get better performance out of my cluster? Bryn _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com