RE: CEPH Erasure Encoding + OSD Scalability

Andreas Joachim Peters <Andreas.Joachim.Peters@xxxxxxx> · Tue, 19 Nov 2013 11:35:52 +0000

Hi Loic et al, 
Dan pointed me to this:

http://sourceforge.net/p/snapraid/code/ci/master/tree/raid.c

It has a very straight forward API and GPL license ...

The implementation seems more performant than the current Jerasure library probably due to the use of ssse3 extensions and slightly less flexibility.... maybe it is worth a plugin or to become "the" plugin? It seems also worth to rewrite the xoring function I use with the sse2 assembler xor ...

It comes also with a nice benchmark tool, here are the results on my 'standard' Xeon for a 4MB block with 8 disks + parity disks ....:

./snapraid -T
snapraid v5.0 by Andrea Mazzoleni, http://snapraid.sourceforge.net
Compiler gcc 4.8.1
CPU GenuineIntel, family 6, model 26, flags mmx sse2 ssse3 sse42
Memory is little-endian 64-bit
Support nanosecond timestamps with futimens()

Speed test using 8 buffers of 524288 bytes, for a total of 4096 KiB.
The reported value is the sustainable aggregate bandwidth of all data disks in MiB/s (not counting parity disks).

Memory write speed using the C memset() function:
  memset   15873

CRC used to check the content file integrity:
   table     857
   intel    6689

Hash used to check the data blocks integrity:
            best murmur3 spooky2
    hash spooky2    2987    6998

RAID functions used for computing the parity with 'sync':
            best    int8   int32   int64    sse2   sse2e   ssse3  ssse3e
    par1    sse2            6201   11080   19404
    par2   sse2e            1851    3462    9949   10359
    parz   sse2e            1134    2020    5157    5738
    par3  ssse3e     421                                    4766    5225
    par4  ssse3e     303                                    3449    3844
    par5  ssse3e     241                                    2750    2830
    par6  ssse3e     198                                    2189    2261

RAID functions used for recovering with 'fix':
            best    int8   ssse3
    rec1   ssse3     496    1029
    rec2   ssse3     208     477
    rec3   ssse3      51     261
    rec4   ssse3      33     170
    rec5   ssse3      22     112
    rec6   ssse3      16      86

________________________________________
From: Loic Dachary [loic@xxxxxxxxxxx]
Sent: 12 November 2013 19:06
To: Andreas Joachim Peters
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: CEPH Erasure Encoding + OSD Scalability

Hi Andreas,

On 12/11/2013 02:11, Andreas Joachim Peters wrote:
> Hi Loic,
>
> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.
>
> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF
>
> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.
>
> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:
>
> • With Liberation coding, w must be a prime number [Pla08b].
> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].
>
> ...
>
> Do you add this fixes?

Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754
>
> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:
>
>
> # -----------------------------------------------------------------
> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@xxxxxxx
> # Ram-Size=12614856704 Allocation-Size=100000000
> # -----------------------------------------------------------------
> # [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
> # [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038    [ms] size-overhead=40   [%]
> ..
> ..
> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604    [ms] size-overhead=16   [%]
> # -----------------------------------------------------------------
> # Erasure Code Performance Summary::
> # -----------------------------------------------------------------
> # RAM:                   12.61 GB
> # Allocation-Size        0.10 GB
> # -----------------------------------------------------------------
> # Byte Initialization:   29.35 MB/s
> # Memcpy:                5.41 GB/s
> # Triple-XOR:            4.38 GB/s
> # -----------------------------------------------------------------
> # Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
> # -----------------------------------------------------------------
> # .................................................................
> # Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
> # Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
> # Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
> # Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
> # Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
> # .................................................................
> # Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
> # Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
> # Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
> # Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
> # Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
> # .................................................................
> # Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
> # Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
> # Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
> # .................................................................
>
> It takes around 30 second on my box.

That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?

> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ?

It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.

Shall I add the possiblity to test a single user specified configuration via command line arguments?
>
I would need to play with it to comment usefully.

Cheers

--
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html