Re: Comparison of 3 replication models on Pech OSD cluster

Roman Penyaev <rpenyaev@xxxxxxx> · Mon, 27 Jul 2020 13:49:45 +0200

Hi Vitali,

Sorry for a long response, I was on vacation.

On 2020-07-20 00:44, vitalif@xxxxxxxxxx wrote:
Hi Roman,

It's always really interesting to read your messages :) maybe you'll
join our telegram chat @ceph_ru? One of your colleagues is there :)

Promise a lot of fun? ;)

Client-based replication is of course the fastest,

With this comparison I answer the question: if it is fastest, then how
much. Because obvious "of course is the fastest", well, not so reasoned 
:)

but the problem is that it's unclear how to provide consistency with 
it.

Here I tried to highlight client-based replication problems:

https://lists.ceph.io/hyperkitty/list/dev@xxxxxxx/thread/N46NR7NBHWBQL4B2ASU7Y2LMKZZPK3IX/

And yes, comes with restrictions, but there are scenarios which do not
require strong sequential consistency of log-based replication, e.g.
if you run an 1 rbd client per 1 image and a filesystem on top with
journaling and strong requests order why not to rely on the filesystem
recovery mechanisms?

That is the question which bothers me for quite a while and that is 
exactly
the reason why I started pech osd: to find some answers.

Maybe it's
possible with some restrictions, but... I don't think it's possible in
Ceph :)

No, for sure not. Not RADOS strong consistency semantics. But why? Take
what you need and cut what is useless to fit your requirements.

By the way, have you tested their Crimson OSD?

Yes, I did. But with Crimson OSD everything went wrong: the first 
problem I
came across is that I was not able to reach desired number of 120 OSDs
running: the average number after various restarts I got from monitor
is ~50.  I did not try to debug and simply reduced number of OSDs to
35 (5 hosts, 7 ODSs on each) and reran all tests, so here is the results
for all types of osds (for fair reference):

35OSDS/crimson/primary-copy

         write/iops      write/bw   write/clat_ns/mean
    4k       3.50 K    14.30 MB/s            888.00 ms
    8k       4.97 K    40.37 MB/s            687.27 ms
   16k       4.90 K    79.63 MB/s            709.63 ms
   32k       4.48 K   145.50 MB/s            703.80 ms
   64k       4.46 K   290.72 MB/s            731.06 ms
  128k       4.38 K   570.44 MB/s            720.32 ms
  256k       4.13 K     1.05 GB/s            755.77 ms
  512k       2.56 K     1.32 GB/s              1.15 s
    1m       1.16 K     1.24 GB/s              2.22 s

35OSDS/ceph/primary-copy

         write/iops      write/bw   write/clat_ns/mean
    4k      90.74 K   355.92 MB/s             44.51 ms
    8k      75.03 K   588.98 MB/s             53.45 ms
   16k      92.58 K     1.42 GB/s             42.10 ms
   32k     122.95 K     3.76 GB/s             33.27 ms
   64k     101.28 K     6.20 GB/s             39.45 ms
  128k      45.28 K     5.57 GB/s             83.26 ms
  256k      26.08 K     6.44 GB/s            138.16 ms
  512k      14.49 K     7.23 GB/s            254.74 ms
    1m       5.91 K     6.05 GB/s            588.28 ms

35OSDS/pech/primary-copy

         write/iops    write/bw   write/clat_ns/mean
    4k     289.22 K   1.10 GB/s             14.94 ms
    8k     231.93 K   1.77 GB/s             15.94 ms
   16k     228.60 K   3.49 GB/s             17.28 ms
   32k     208.95 K   6.39 GB/s             19.08 ms
   64k     106.66 K   6.53 GB/s             37.69 ms
  128k      53.48 K   6.57 GB/s             73.03 ms
  256k      25.03 K   6.19 GB/s            139.59 ms
  512k      12.63 K   6.32 GB/s            302.50 ms
    1m       5.91 K   6.05 GB/s            650.03 ms

I did not notice anything strange in crimsons logs and did try to
debug, so do not know why the results are so bad for the crimson
case.

Is it any faster than
current implementation? (regarding iodepth=1 fsync=1 latency)

My original goal was to test real distributed load: many osd hosts,
many clients hosts (I was keen to see how Pech was behaving).

Your "latency" load does not require a cluster setup and can be
executed on a localhost with 3 osds (x3 replication), so here
are the results:

"-o ms_crc_data=false -o debug_osd=0 -o debug_ms=0"

rbd.fio
rw=randwrite
iodepth=1
numjobs=1
runtume=10
size=256m

/// crimson-osd

  4k  IOPS=101, BW=406KiB/s, Lat=9846.09usec
  8k  IOPS=100, BW=802KiB/s, Lat=9973.39usec
 16k  IOPS=99, BW=1599KiB/s, Lat=10000.17usec
 32k  IOPS=96, BW=3088KiB/s, Lat=10355.56usec
 64k  IOPS=591, BW=36.0MiB/s, Lat=1687.63usec
128k  IOPS=508, BW=63.6MiB/s, Lat=1963.95usec
256k  IOPS=379, BW=94.9MiB/s, Lat=2632.29usec
512k  IOPS=338, BW=169MiB/s, Lat=2952.39usec
  1m  IOPS=166, BW=166MiB/s, Lat=6011.29usec

/// ceph-osd

  4k  IOPS=1908, BW=7634KiB/s, Lat=522.07usec
  8k  IOPS=1838, BW=14.4MiB/s, Lat=542.10usec
 16k  IOPS=1751, BW=27.4MiB/s, Lat=568.98usec
 32k  IOPS=2048, BW=64.0MiB/s, Lat=486.48usec
 64k  IOPS=1985, BW=124MiB/s, Lat=501.80usec
128k  IOPS=1869, BW=234MiB/s, Lat=532.96usec
256k  IOPS=1645, BW=411MiB/s, Lat=605.66usec
512k  IOPS=1195, BW=598MiB/s, Lat=833.64usec
  1m  IOPS=704, BW=705MiB/s, Lat=1414.01usec

/// pech-osd

OSD=X; CEPH=~/devel/ceph-upstream; ./pech-osd --mon_addrs 
192.168.0.97:50001 --server_ip 0.0.0.0 --name $OSD --fsid `cat 
$CEPH/build/dev/osd$OSD/fsid` --class_dir $CEPH/build/lib --log_level 5 
--replication primary-copy --nocrc

  4k  IOPS=5618, BW=21.9MiB/s, Lat=176.48usec
  8k  IOPS=5654, BW=44.2MiB/s, Lat=175.26usec
 16k  IOPS=5504, BW=86.0MiB/s, Lat=180.20usec
 32k  IOPS=4976, BW=156MiB/s, Lat=199.37usec
 64k  IOPS=4334, BW=271MiB/s, Lat=229.09usec
128k  IOPS=3397, BW=425MiB/s, Lat=292.52usec
256k  IOPS=2392, BW=598MiB/s, Lat=416.12usec
512k  IOPS=1505, BW=753MiB/s, Lat=661.25usec
  1m  IOPS=687, BW=688MiB/s, Lat=1446.60usec

Results should be treated carefully, since practically these numbers
are almost certainly unreachable, but here you are right: numbers give
a clear upper bound.

--
Roman
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx