Hi Roman,
It's always really interesting to read your messages :) maybe you'll
join our telegram chat @ceph_ru? One of your colleagues is there :)
Client-based replication is of course the fastest, but the problem is
that it's unclear how to provide consistency with it. Maybe it's
possible with some restrictions, but... I don't think it's possible in
Ceph :)
By the way, have you tested their Crimson OSD? Is it any faster than
current implementation? (regarding iodepth=1 fsync=1 latency)
Hi all,
I would like to share the comparison of 3 replication models based on
Pech OSD [1] cluster, which supports a sufficient minimum to replicate
transactions from OSD to OSD and keeps all data mutations in memory
(memstore).
My goal was to compare "primary-copy", "chain" and "client-based"
replication models and answer the question: how each model affects
network performance.
For this estimation I chose to implement my own OSD with bare minimum
(laborious but worth it) which design is similar to Crimson OSD but
core is based on sources from kernel libceph implementation
(i.e. messenger, osdmap, mon_client, etc), thus written in pure C.
-- What Pech OSD supports and what does not --
Comparison of the network response under different replication
scenarios does not require fail-over (we assume that during testing,
storing data in memory never fails, hosts never crash, etc), thus to
ease development Pech OSD does not support (current state of the code)
peering and fail-over. So object modification is replicated on each
mutation, yes, but cluster is not able to come to the consistent state
after an error.
Pech OSD supports RBD images, so that image can be accessed from
userspace librbd or mapped by kernel RBD block device. That is a bare
minimum which I need to run FIO loads and test network behavior.
-- What I test --
Originally my goal was to compare performance under same loads but
using different replication models: "client-based", "primary-copy" and
"chain". I want to see the numbers what different models can bring in
terms of network bandwidth, latency and IOPS (and when we talk about
comparison of replication models, only network is the factor which
impacts the overall performance).
Shortly about replication models:
"client-based" - client itself is responsible for sending requests to
replicas. To test this model OSD client code was modified on
userspace [2] and kernel [3] sides. Pros: savings on network hops
which reduces latency. Cons: complications in replication algorithm
when PG is not healthy, complications in replication algorithm when
there is a concurrent access to the same object from divers
clients, client network should be fat enough.
"chain" - client sends write request to primary, primary forwards to
the next secondary, and so on. Final ACK from last replica in chain
reaches primary or client directly. Pros: each OSD sends a request
only once, which reduces load on network for particular node and
spreads load. Overall bandwidth should increase. Cons: sequential
requests processing, which should impacts latency.
"primary-copy" - default and the only one model for Ceph: client
accesses primary replica, primary replica fans out data to
secondaries. Pros: already implemented. Cons: higher latency
comparing to "client-based", lower bandwidth comparing to "chain".
What is said above is the theory which has motivated me to prove or
disprove it with numbers on the real cluster.
-- How I test --
I have the cluster at my disposal, with 5 hosts with 100gb network for
OSDs and 8 client hosts, with 25gbit/s network.
Each OSD host has 24 CPUs, so for obvious reasons each host runs 24
OSDs, so (24x5) 120 Pech OSDs for the whole cluster setup.
There is one fully declustered pool with 1024 PGs (I want to spread
the load as much as possible). Pool is created with 3x replication
factor.
Each client starts 16 FIO jobs with random write to 16 RBD images
(userspace RBD client) with various block sizes, i.e. one FIO job per
image and 128 (16x8) jobs in total. Each client host runs FIO server,
all data from all servers is aggregated by FIO client and stored in
json format. There is a convenient python script [4] which generates
and runs FIO jobs, parses json results and outputs them in a human
readable pretty table.
Major FIO options:
ioengine=rbd
clientname=admin
pool=rbd
rw=randwrite
size=256m
time_based=1
runtime=10
ramp_time=10
iodepth=32
numjobs=1
During all tests I collected almost 1Gb of json data results. Pretty
enough for good analysis.
-- Results --
Firstly I would like to start comparing "primary-copy" and "chain"
on Pech OSD:
120OSDS/pech/primary-copy
write/iops write/bw write/clat_ns/mean
4k 365.89 K 1.40 GB/s 11.11 ms
8k 330.51 K 2.52 GB/s 12.22 ms
16k 274.06 K 4.19 GB/s 14.79 ms
32k 204.36 K 6.25 GB/s 19.95 ms
64k 141.78 K 8.68 GB/s 28.54 ms
128k 70.42 K 8.64 GB/s 58.99 ms
256k 37.75 K 9.30 GB/s 109.75 ms
512k 17.46 K 8.67 GB/s 216.53 ms
1m 8.56 K 8.65 GB/s 474.94 ms
120OSDS/pech/chain
write/iops write/bw write/clat_ns/mean
4k 380.29 K 1.45 GB/s 10.72 ms
8k 339.10 K 2.59 GB/s 11.99 ms
16k 280.28 K 4.28 GB/s 14.34 ms
32k 206.84 K 6.32 GB/s 19.64 ms
64k 131.57 K 8.05 GB/s 30.54 ms
128k 74.78 K 9.18 GB/s 54.25 ms
256k 39.82 K 9.81 GB/s 103.27 ms
512k 18.47 K 9.17 GB/s 213.78 ms
1m 8.98 K 9.08 GB/s 461.12 ms
There is a slight difference in the direction of bandwidth increase
for "chain" model, but I would rather take it for a noise. Another
runs for similar configuration show almost similar results: there is
a minor "bandwidth" improve but not so solid.
Client-based results are much more interesting:
120OSDS/pech/client-based
write/iops write/bw write/clat_ns/mean
4k 534.08 K 2.04 GB/s 7.62 ms
8k 471.78 K 3.60 GB/s 8.64 ms
16k 367.12 K 5.61 GB/s 11.11 ms
32k 242.56 K 7.41 GB/s 16.82 ms
64k 124.54 K 7.63 GB/s 32.98 ms
128k 62.45 K 7.67 GB/s 66.71 ms
256k 31.10 K 7.69 GB/s 135.36 ms
512k 15.41 K 7.71 GB/s 282.41 ms
1m 7.63 K 7.82 GB/s 567.63 ms
Small blocks show significant improve in latency: almost 40%, from
380k IOPS to 534k IOPS. Starting from 64k block the client network
25gbit/s is reached ("client-based" replication means client is
responsible for sending the data to all replicas, that means that each
byte with 3x replication factor should be repeated 3 times from each
client host, having ~8GB/s for 8 clients we estimate each client sends
~1GB/s, with 3x replication factor this is ~3GB/s and this is exactly
the ~24gbit/s of the client network).
What is important to keep in mind with Pech OSD design is that each
OSD process has only 1 OS thread, so when request is received and
request handler is executed there is no any preemption happens and no
other requests can be handled in parallel (unless special scheduling
routine is called, which is not, at least in current code state). So
various PGs on particular Pech OSD are handled sequentially.
The design is highly CPU bound, thus one simple trick can be made to
increase bandwidth: OSD pinning to CPU. Since we have 24 OSDs and 24
CPUs CPU affinitty is easy to apply:
120OSDS-AFF/pech/primary-copy
write/iops write/bw write/clat_ns/mean
4k 324.15 K 1.24 GB/s 12.35 ms
8k 293.52 K 2.24 GB/s 13.43 ms
16k 235.53 K 3.60 GB/s 16.46 ms
32k 187.31 K 5.73 GB/s 20.77 ms
64k 170.60 K 10.43 GB/s 23.10 ms
128k 92.54 K 11.33 GB/s 34.48 ms
256k 47.69 K 11.73 GB/s 97.32 ms
512k 18.52 K 9.19 GB/s 252.26 ms
1m 9.20 K 9.28 GB/s 507.33 ms
Bandwidth looks better for bigger blocks.
In conclusion about replication models. I did not notice any
significant difference between "primary-copy" and "chain". Perhaps it
makes sense to play with the replication factor.
In its turn "client-based" replication can be very promising for loads
in homogeneous networks, where there is no any concurrent access to
images. Simple example is a cluster with compute and storage nodes in
private network, where VMs access their own images. For such setups
latency is a factor which plays a huge role.
--
Roman
[1] https://github.com/rouming/pech
[2] https://github.com/rouming/ceph/tree/pech-osd
[3]
https://github.com/rouming/linux/tree/akpm--ceph-client-based-replication
[4] https://github.com/rouming/pech/blob/master/scripts/fio-runner.py
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx