Re: Comparison of 3 replication models on Pech OSD cluster

vitalif@xxxxxxxxxx · Mon, 20 Jul 2020 01:44:08 +0300

Hi Roman,

It's always really interesting to read your messages :) maybe you'll 
join our telegram chat @ceph_ru? One of your colleagues is there :)

Client-based replication is of course the fastest, but the problem is 
that it's unclear how to provide consistency with it. Maybe it's 
possible with some restrictions, but... I don't think it's possible in 
Ceph :)

By the way, have you tested their Crimson OSD? Is it any faster than 
current implementation? (regarding iodepth=1 fsync=1 latency)

Hi all,

I would like to share the comparison of 3 replication models based on
Pech OSD [1] cluster, which supports a sufficient minimum to replicate
transactions from OSD to OSD and keeps all data mutations in memory
(memstore).

My goal was to compare "primary-copy", "chain" and "client-based"
replication models and answer the question: how each model affects
network performance.

For this estimation I chose to implement my own OSD with bare minimum
(laborious but worth it) which design is similar to Crimson OSD but
core is based on sources from kernel libceph implementation
(i.e. messenger, osdmap, mon_client, etc), thus written in pure C.

-- What Pech OSD supports and what does not --

Comparison of the network response under different replication
scenarios does not require fail-over (we assume that during testing,
storing data in memory never fails, hosts never crash, etc), thus to
ease development Pech OSD does not support (current state of the code)
peering and fail-over. So object modification is replicated on each
mutation, yes, but cluster is not able to come to the consistent state
after an error.

Pech OSD supports RBD images, so that image can be accessed from
userspace librbd or mapped by kernel RBD block device. That is a bare
minimum which I need to run FIO loads and test network behavior.

-- What I test --

Originally my goal was to compare performance under same loads but
using different replication models: "client-based", "primary-copy" and
"chain". I want to see the numbers what different models can bring in
terms of network bandwidth, latency and IOPS (and when we talk about
comparison of replication models, only network is the factor which
impacts the overall performance).

Shortly about replication models:

"client-based" - client itself is responsible for sending requests to
   replicas. To test this model OSD client code was modified on
   userspace [2] and kernel [3] sides.  Pros: savings on network hops
   which reduces latency. Cons: complications in replication algorithm
   when PG is not healthy, complications in replication algorithm when
   there is a concurrent access to the same object from divers
   clients, client network should be fat enough.

"chain" - client sends write request to primary, primary forwards to
   the next secondary, and so on. Final ACK from last replica in chain
   reaches primary or client directly. Pros: each OSD sends a request
   only once, which reduces load on network for particular node and
   spreads load. Overall bandwidth should increase. Cons: sequential
   requests processing, which should impacts latency.

"primary-copy" - default and the only one model for Ceph: client
   accesses primary replica, primary replica fans out data to
   secondaries. Pros: already implemented.  Cons: higher latency
   comparing to "client-based", lower bandwidth comparing to "chain".

What is said above is the theory which has motivated me to prove or
disprove it with numbers on the real cluster.

-- How I test --

I have the cluster at my disposal, with 5 hosts with 100gb network for
OSDs and 8 client hosts, with 25gbit/s network.

Each OSD host has 24 CPUs, so for obvious reasons each host runs 24
OSDs, so (24x5) 120 Pech OSDs for the whole cluster setup.

There is one fully declustered pool with 1024 PGs (I want to spread
the load as much as possible). Pool is created with 3x replication
factor.

Each client starts 16 FIO jobs with random write to 16 RBD images
(userspace RBD client) with various block sizes, i.e. one FIO job per
image and 128 (16x8) jobs in total. Each client host runs FIO server,
all data from all servers is aggregated by FIO client and stored in
json format. There is a convenient python script [4] which generates
and runs FIO jobs, parses json results and outputs them in a human
readable pretty table.

Major FIO options:

ioengine=rbd
clientname=admin
pool=rbd

rw=randwrite
size=256m

time_based=1
runtime=10
ramp_time=10

iodepth=32
numjobs=1

During all tests I collected almost 1Gb of json data results. Pretty
enough for good analysis.

-- Results --

Firstly I would like to start comparing "primary-copy" and "chain"
on Pech OSD:

120OSDS/pech/primary-copy

         write/iops    write/bw   write/clat_ns/mean
    4k     365.89 K   1.40 GB/s             11.11 ms
    8k     330.51 K   2.52 GB/s             12.22 ms
   16k     274.06 K   4.19 GB/s             14.79 ms
   32k     204.36 K   6.25 GB/s             19.95 ms
   64k     141.78 K   8.68 GB/s             28.54 ms
  128k      70.42 K   8.64 GB/s             58.99 ms
  256k      37.75 K   9.30 GB/s            109.75 ms
  512k      17.46 K   8.67 GB/s            216.53 ms
    1m       8.56 K   8.65 GB/s            474.94 ms

120OSDS/pech/chain

         write/iops    write/bw   write/clat_ns/mean
    4k     380.29 K   1.45 GB/s             10.72 ms
    8k     339.10 K   2.59 GB/s             11.99 ms
   16k     280.28 K   4.28 GB/s             14.34 ms
   32k     206.84 K   6.32 GB/s             19.64 ms
   64k     131.57 K   8.05 GB/s             30.54 ms
  128k      74.78 K   9.18 GB/s             54.25 ms
  256k      39.82 K   9.81 GB/s            103.27 ms
  512k      18.47 K   9.17 GB/s            213.78 ms
    1m       8.98 K   9.08 GB/s            461.12 ms

There is a slight difference in the direction of bandwidth increase
for "chain" model, but I would rather take it for a noise. Another
runs for similar configuration show almost similar results: there is
a minor "bandwidth" improve but not so solid.

Client-based results are much more interesting:

120OSDS/pech/client-based

         write/iops    write/bw   write/clat_ns/mean
    4k     534.08 K   2.04 GB/s              7.62 ms
    8k     471.78 K   3.60 GB/s              8.64 ms
   16k     367.12 K   5.61 GB/s             11.11 ms
   32k     242.56 K   7.41 GB/s             16.82 ms
   64k     124.54 K   7.63 GB/s             32.98 ms
  128k      62.45 K   7.67 GB/s             66.71 ms
  256k      31.10 K   7.69 GB/s            135.36 ms
  512k      15.41 K   7.71 GB/s            282.41 ms
    1m       7.63 K   7.82 GB/s            567.63 ms

Small blocks show significant improve in latency: almost 40%, from
380k IOPS to 534k IOPS. Starting from 64k block the client network
25gbit/s is reached ("client-based" replication means client is
responsible for sending the data to all replicas, that means that each
byte with 3x replication factor should be repeated 3 times from each
client host, having ~8GB/s for 8 clients we estimate each client sends
~1GB/s, with 3x replication factor this is ~3GB/s and this is exactly
the ~24gbit/s of the client network).

What is important to keep in mind with Pech OSD design is that each
OSD process has only 1 OS thread, so when request is received and
request handler is executed there is no any preemption happens and no
other requests can be handled in parallel (unless special scheduling
routine is called, which is not, at least in current code state). So
various PGs on particular Pech OSD are handled sequentially.

The design is highly CPU bound, thus one simple trick can be made to
increase bandwidth: OSD pinning to CPU. Since we have 24 OSDs and 24
CPUs CPU affinitty is easy to apply:

120OSDS-AFF/pech/primary-copy

         write/iops     write/bw   write/clat_ns/mean
    4k     324.15 K    1.24 GB/s             12.35 ms
    8k     293.52 K    2.24 GB/s             13.43 ms
   16k     235.53 K    3.60 GB/s             16.46 ms
   32k     187.31 K    5.73 GB/s             20.77 ms
   64k     170.60 K   10.43 GB/s             23.10 ms
  128k      92.54 K   11.33 GB/s             34.48 ms
  256k      47.69 K   11.73 GB/s             97.32 ms
  512k      18.52 K    9.19 GB/s            252.26 ms
    1m       9.20 K    9.28 GB/s            507.33 ms

Bandwidth looks better for bigger blocks.

In conclusion about replication models. I did not notice any
significant difference between "primary-copy" and "chain". Perhaps it
makes sense to play with the replication factor.

In its turn "client-based" replication can be very promising for loads
in homogeneous networks, where there is no any concurrent access to
images. Simple example is a cluster with compute and storage nodes in
private network, where VMs access their own images. For such setups
latency is a factor which plays a huge role.

--
Roman

[1] https://github.com/rouming/pech
[2] https://github.com/rouming/ceph/tree/pech-osd
[3] 
https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication
[4] https://github.com/rouming/pech/blob/master/scripts/fio-runner.py
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx