Re: CEPH Erasure Encoding + OSD Scalability

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 09 Dec 2013 11:03:08 -0600

I will mention that this is a good tool if you want really detailed 
profiling or cpu counter data about what's going on.  Other tools that 
are more generic (ie ones that just read data from proc, ie collectl, 
sar, etc) may also be options.

Mark

On 12/09/2013 10:45 AM, Loic Dachary wrote:
Hi,

Mark Nelson suggested we use perf ( linux-tools ) for benchmarking. It looks like something that would help indeed : the benchmark program would only concern itself with doing some work according to the options and let performances be collected from the outside, using tools that are familiar to people doing benchmarking.

What do you think ?

Cheers

$ perf stat -e
   Error: switch `e' requires a value

  usage: perf stat [<options>] [<command>]

     -e, --event <event>   event selector. use 'perf list' to list available events
         --filter <filter>
                           event filter
     -i, --no-inherit      child tasks do not inherit counters
     -p, --pid <pid>       stat events on existing process id
     -t, --tid <tid>       stat events on existing thread id
     -a, --all-cpus        system-wide collection from all CPUs
     -g, --group           put the counters into a counter group
     -c, --scale           scale/normalize counters
     -v, --verbose         be more verbose (show counter open errors, etc)
     -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
     -n, --null            null run - dont start any counters
     -d, --detailed        detailed run - start a lot of events
     -S, --sync            call sync() before starting a run
     -B, --big-num         print large numbers with thousands' separators
     -C, --cpu <cpu>       list of cpus to monitor in system-wide
     -A, --no-aggr         disable CPU count aggregation
     -x, --field-separator <separator>
                           print counts with custom separator
     -G, --cgroup <name>   monitor event in cgroup name only
     -o, --output <file>   output file name
         --append          append to the output file
         --log-fd <n>      log output to fd, instead of stderr
         --pre <command>   command to run prior to the measured command
         --post <command>  command to run after to the measured command
     -I, --interval-print <n>
                           print counts at regular interval in ms (>= 100)
         --per-socket      aggregate counts per processor socket
         --per-core        aggregate counts per physical processor core

On 12/11/2013 19:06, Loic Dachary wrote:
Hi Andreas,

On 12/11/2013 02:11, Andreas Joachim Peters wrote:
Hi Loic,

I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV.

All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF

"Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number  of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure.

With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code:

• With Liberation coding, w must be a prime number [Pla08b].
• With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a].

...

Do you add this fixes?

Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754

For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output:

# -----------------------------------------------------------------
# Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@xxxxxxx
# Ram-Size=12614856704 Allocation-Size=100000000
# -----------------------------------------------------------------
# [ -BENCH- ] [       ] technique=memcpy                                                            speed=5.408 [GB/s] latency=9.245 ms
# [ -BENCH- ] [       ] technique=d=a^b^c-xor                                                       speed=4.377 [GB/s] latency=17.136 ms
# [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000          speed=1.308 [GB/s] latency=038	[ms] size-overhead=40	[%]
..
..
# [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000          speed=0.083 [GB/s] latency=604	[ms] size-overhead=16	[%]
# -----------------------------------------------------------------
# Erasure Code Performance Summary::
# -----------------------------------------------------------------
# RAM:                   12.61 GB
# Allocation-Size        0.10 GB
# -----------------------------------------------------------------
# Byte Initialization:   29.35 MB/s
# Memcpy:                5.41 GB/s
# Triple-XOR:            4.38 GB/s
# -----------------------------------------------------------------
# Fastest RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
# Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
# Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
# -----------------------------------------------------------------
# .................................................................
# Top 1  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000
# Top 2  RAID6          2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000
# Top 3  RAID6          2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000
# Top 4  RAID6          2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000
# Top 5  RAID6          2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000
# .................................................................
# Top 1  Triple         0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000
# Top 2  Triple         0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000
# Top 3  Triple         0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000
# Top 4  Triple         0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000
# Top 5  Triple         0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000
# .................................................................
# Top 1  Quadr.         0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000
# Top 2  Quadr.         0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000
# Top 3  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000
# Top 4  Quadr.         0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000
# Top 5  Quadr.         0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000
# .................................................................

It takes around 30 second on my box.

That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ?

I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ?

It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression.

Shall I add the possiblity to test a single user specified configuration via command line arguments?

I would need to play with it to comment usefully.

Cheers

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html