Replication strategy, write throughput

Andreas Gerstmayr <andreas.gerstmayr@xxxxxxxxx> · Fri, 4 Nov 2016 17:10:31 +0100

Hello,

I'd like to understand how replication works.
In the paper [1] several replication strategies are described, and
according to a (bit old) mailing list post [2] primary-copy is used.
Therefore the primary OSD waits until the object is persisted and then
updates all replicas in parallel.

Current cluster setup:
Ceph jewel 10.2.3
6 storage nodes
24 HDDs each, journal on same disk [3]
Frontend network: 10 Gbit/s
Backend network: 2 x 10 Gbit/s bonded with layer3+4 hashing [4]
CephFS with striping: 1M stripe unit, 10 stripe count, 10M object size

My assumption was that there should be no difference whether I write
to replication 2 or 3, because each storage node can accept 10 Gbit/s
traffic from the frontend network and send 10 Gbit/s traffic
simultaneous to two other storage nodes.

Disk write capacity shouldn't be a problem either:
200 MB/s throughput * 6 nodes * 24 disks / 2 (journal) / 3 replicas = 4800 MB/s.

Results with 7 clients:
Replication 1: 5695.33 MB/s
Replication 2: 3337.09 MB/s
Replication 3: 1898.17 MB/s

Replication 2 is about 1/2 of Replication 1, and Replication 3 is
exact 1/3 of Replication 1.
Any hints what the bottleneck is in this case?

[1] http://ceph.com/papers/weil-rados-pdsw07.pdf
[2] http://www.spinics.net/lists/ceph-devel/msg02420.html
[3] Test with fio --name=job --ioengine=libaio --rw=write
--blocksize=1M --size=30G --direct=1 --sync=1 --iodepth=128
--filename=/dev/sdw gives about 200 MB/s (test for journal writes)
[4] Test with iperf3, 1 storage node connects to 2 other nodes to the
backend IP gives 10 Gbit/s throughput for each connection

Thanks,
Andreas
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com