Ceph EC pool performance benchmarking, high latencies.

"MATHIAS, Bryn (Bryn)" <bryn.mathias@xxxxxxxxxxxxxxxxxx> · Fri, 19 Jun 2015 12:28:26 +0000

Hi All,

I am currently benchmarking CEPH to work out the correct read / write model, to get the optimal cluster throughput and latency.

For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised name using the rados python interface.

Load generation is happening on external machines.

Write generation is characterised as the number of IOContexts and the number of simultaneous async writes on those contexts.
With one machine, IOContexts threads and 50 simultaneous writes per context I achieve over 300 seconds:

Percentile 5 = 0.133775639534
Percentile 10 = 0.178686833382
Percentile 15 = 0.180827605724
Percentile 20 = 0.185487747192
Percentile 25 = 0.229317903519
Percentile 30 = 0.23066740036
Percentile 35 = 0.232764816284
Percentile 40 = 0.278827047348
Percentile 45 = 0.280579996109
Percentile 50 = 0.283169865608
Percentile 55 = 0.329843044281
Percentile 60 = 0.332481050491
Percentile 65 = 0.380337607861
Percentile 70 = 0.428911447525
Percentile 75 = 0.438932359219
Percentile 80 = 0.530071306229
Percentile 85 = 0.597331762314
Percentile 90 = 0.735066819191
Percentile 95 = 1.08006491661
Percentile 100 = 11.7352428436
Max latancies = 11.7352428436, Min = 0.0499050617218, mean = 0.43913059745
Total objects writen = 24552 in time 302.979903936s gives 81.0350775118/s (324.140310047 MB/s)

>From two load generators on separate machines I achieve:

Percentile 5 = 0.228541088104
Percentile 10 = 0.23213224411
Percentile 15 = 0.279508590698
Percentile 20 = 0.28137254715
Percentile 25 = 0.328829288483
Percentile 30 = 0.330499911308
Percentile 35 = 0.334045898914
Percentile 40 = 0.380131435394
Percentile 45 = 0.382810294628
Percentile 50 = 0.430188417435
Percentile 55 = 0.43399245739
Percentile 60 = 0.48120136261
Percentile 65 = 0.530511438847
Percentile 70 = 0.580485081673
Percentile 75 = 0.631661534309
Percentile 80 = 0.728989124298
Percentile 85 = 0.830820584297
Percentile 90 = 1.03238985538
Percentile 95 = 1.62925363779
Percentile 100 = 32.5414278507
Max latancies = 32.5414278507, Min = 0.0375339984894, mean = 0.863403101415
Total objects writen = 12714 in time 325.92741394s gives 39.0086855422/s (156.034742169 MB/s)

Percentile 5 = 0.229072237015
Percentile 10 = 0.247376871109
Percentile 15 = 0.280901908875
Percentile 20 = 0.329082489014
Percentile 25 = 0.331234931946
Percentile 30 = 0.379406833649
Percentile 35 = 0.381390666962
Percentile 40 = 0.429595994949
Percentile 45 = 0.43164896965
Percentile 50 = 0.480262041092
Percentile 55 = 0.529169607162
Percentile 60 = 0.533170747757
Percentile 65 = 0.582635164261
Percentile 70 = 0.634325170517
Percentile 75 = 0.72939991951
Percentile 80 = 0.829002094269
Percentile 85 = 0.931713819504
Percentile 90 = 1.18014221191
Percentile 95 = 2.08048944473
Percentile 100 = 31.1357450485
Max latancies = 31.1357450485, Min = 0.0553231239319, mean = 1.03054529335
Total objects writen = 10769 in time 328.515608788s gives 32.7807863978/s (131.123145591 MB/s)

Total = 278Mb/s 

The combined test has much higher latencies and a less than half throughput per box.

If I scale this up to 5 nodes all generating load I see the throughput drop to ~50MB/s and latencies up to 60 seconds.

An example slow write from dump_historic_ops is:

            "description": "osd_op(client.1892123.0:1525 \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae ack+ondisk+write+known_if_redirected e523)",
            "initiated_at": "2015-06-19 12:37:54.698848",
            "age": 578.438516,
            "duration": 38.399151,
            "type_data": [
                "commit sent; apply or cleanup",
                {
                    "client": "client.1892123",
                    "tid": 1525
                },
                [
                    {
                        "time": "2015-06-19 12:37:54.698848",
                        "event": "initiated"
                    },
                    {
                        "time": "2015-06-19 12:37:54.856361",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2015-06-19 12:37:55.095731",
                        "event": "started"
                    },
                    {
                        "time": "2015-06-19 12:37:55.103645",
                        "event": "started"
                    },
                    {
                        "time": "2015-06-19 12:37:55.104125",
                        "event": "commit_queued_for_journal_write"
                    },
                    {
                        "time": "2015-06-19 12:37:55.104900",
                        "event": "write_thread_in_journal_buffer"
                    },
                    {
                        "time": "2015-06-19 12:37:55.106112",
                        "event": "journaled_completion_queued"
                    },
                    {
                        "time": "2015-06-19 12:37:55.107065",
                        "event": "sub_op_committed"
                    },
                    {
                        "time": "2015-06-19 12:37:55.117510",
                        "event": "commit_sent"
                    },
                    {
                        "time": "2015-06-19 12:37:55.187676",
                        "event": "sub_op_applied"
                    },
                    {
                        "time": "2015-06-19 12:38:33.097998",
                        "event": "done"
                    }
                ]
            ]
        }
    ]
}

There is a very large wait time between 'sub_op_applied' and 'done'

My ceph cluster is made up of 5 machines, with 13 OSDs each. 10Gb links between each of the ceph nodes on both the public and private networks
5 Load generation blades connected with a 20Gb aggregate connection
Watching dstat on these machines whilst the tests are running shows they aren’t being highly stressed.

I have managed to push a lot more though the system, but not when worrying about latency, to do this I tend to run one IOcontext and continuously fire aio_write_full operations on it.

Does anyone have any suggestion for how to get better performance out of my cluster?

Bryn
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com