Re: Replication strategy, write throughput

Christian Balzer <chibi@xxxxxxx> · Mon, 7 Nov 2016 11:05:45 +0900

Hello,

On Fri, 4 Nov 2016 17:10:31 +0100 Andreas Gerstmayr wrote:

> Hello,
> 
> I'd like to understand how replication works.
> In the paper [1] several replication strategies are described, and
> according to a (bit old) mailing list post [2] primary-copy is used.
> Therefore the primary OSD waits until the object is persisted and then
> updates all replicas in parallel.
> 
Given that nobody who actually knows the answers to this has piped up, I
shall give it a go.

The "persisted" above means "written to the journal" and the parallel bit
is quite the mystery to me as well.
As in, it's clearly not sequential, but one most likely gets kicked off
before the other and I for one am uncertain how much delay that entails.
But it's unlikely to be the culprit, see below.

> Current cluster setup:
> Ceph jewel 10.2.3
> 6 storage nodes
> 24 HDDs each, journal on same disk [3]
Unrelated to the topic at hand, journal on same HDD is a recipe for pain.
Your fio result against a single HDD is fine (quite good actually), but
that is ignoring a lot of complexity in real (Ceph) life.

> Frontend network: 10 Gbit/s
> Backend network: 2 x 10 Gbit/s bonded with layer3+4 hashing [4]
Bandwidth is one thing, latency another. 
With some setups and switches in particular LACP can add noticeable
latency.

> CephFS with striping: 1M stripe unit, 10 stripe count, 10M object size
> 
I know nothing about CephFS and how this striping relates to actual RADOS
striping, but having smaller stripe units will usually be better for
latency, not bandwidth.

> My assumption was that there should be no difference whether I write
> to replication 2 or 3, because each storage node can accept 10 Gbit/s
> traffic from the frontend network and send 10 Gbit/s traffic
> simultaneous to two other storage nodes.
> 
Theory, meet reality.
The biggest impact when increasing replication size is latency (even when
going from 2 to 3, thus my question marks about how parallel these things
really are).
So while that will affect your large sequential writes somewhat, it's not
what causing your results.

> Disk write capacity shouldn't be a problem either:
> 200 MB/s throughput * 6 nodes * 24 disks / 2 (journal) / 3 replicas = 4800 MB/s.
>
This is where your expectations and estimations start to fall apart.
The above is overly optimistic and simplistic. 

It assumes:
1. Single, sequential writes over the whole cluster (the biggest flaw)
2. Sequential writes within the OSD (they're not, it's 4 or in your case
maybe 1MB individual file writes).
3. No overhead from FS journaling and the resulting double and seeks.

I personally feel happy when individual HDD based OSDs WITH SSD journals
can maintain about 60MB/s writes.

Looking at your results below, that number seems to be around 40MB/s
for you in the single replication test and thus should be the baseline
for your expectations.

As always, I recommend running atop or the likes  on all your nodes with a
high (5s or less) interval in huge terminals when doing such tests.
You'll see how busy your individual OSDs are and that some are going to be
near or at their max capacity.

With that in mind, remember that Ceph doesn't guarantee equal distribution
of anything and that the slowest OSD will determine the speed of the whole
thing (eventually). 
So even with 144 OSDs you'll likely have some OSDs that are busier
(receiving primary and secondary writes at the same time) than others,
slowing things down for everyone.

Again, IF you're testing against CephFS and IF the stripping does what I
think it does, you're creating more I/O and thus busy OSDs than a default
setup would do.

> 
> Results with 7 clients:
Details please. 
As in, what test (fio/dd?) did you use (exact command line), against what
(CephFS, mounted how?).

> Replication 1: 5695.33 MB/s
> Replication 2: 3337.09 MB/s
> Replication 3: 1898.17 MB/s
> 
> Replication 2 is about 1/2 of Replication 1, and Replication 3 is
> exact 1/3 of Replication 1.
> Any hints what the bottleneck is in this case?
> 
As hopefully made clear above, primarily your OSD bandwidth.

Christian
> 
> [1] http://ceph.com/papers/weil-rados-pdsw07.pdf
> [2] http://www.spinics.net/lists/ceph-devel/msg02420.html
> [3] Test with fio --name=job --ioengine=libaio --rw=write
> --blocksize=1M --size=30G --direct=1 --sync=1 --iodepth=128
> --filename=/dev/sdw gives about 200 MB/s (test for journal writes)
> [4] Test with iperf3, 1 storage node connects to 2 other nodes to the
> backend IP gives 10 Gbit/s throughput for each connection
> 
> 
> Thanks,
> Andreas
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com