Maybe using http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html is enough. fsbench looks overkill indeed. /me exploring options ;-) On 09/12/2013 17:45, Loic Dachary wrote: > Hi, > > Mark Nelson suggested we use perf ( linux-tools ) for benchmarking. It looks like something that would help indeed : the benchmark program would only concern itself with doing some work according to the options and let performances be collected from the outside, using tools that are familiar to people doing benchmarking. > > What do you think ? > > Cheers > > $ perf stat -e > Error: switch `e' requires a value > > usage: perf stat [<options>] [<command>] > > -e, --event <event> event selector. use 'perf list' to list available events > --filter <filter> > event filter > -i, --no-inherit child tasks do not inherit counters > -p, --pid <pid> stat events on existing process id > -t, --tid <tid> stat events on existing thread id > -a, --all-cpus system-wide collection from all CPUs > -g, --group put the counters into a counter group > -c, --scale scale/normalize counters > -v, --verbose be more verbose (show counter open errors, etc) > -r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0) > -n, --null null run - dont start any counters > -d, --detailed detailed run - start a lot of events > -S, --sync call sync() before starting a run > -B, --big-num print large numbers with thousands' separators > -C, --cpu <cpu> list of cpus to monitor in system-wide > -A, --no-aggr disable CPU count aggregation > -x, --field-separator <separator> > print counts with custom separator > -G, --cgroup <name> monitor event in cgroup name only > -o, --output <file> output file name > --append append to the output file > --log-fd <n> log output to fd, instead of stderr > --pre <command> command to run prior to the measured command > --post <command> command to run after to the measured command > -I, --interval-print <n> > print counts at regular interval in ms (>= 100) > --per-socket aggregate counts per processor socket > --per-core aggregate counts per physical processor core > > > On 12/11/2013 19:06, Loic Dachary wrote: >> Hi Andreas, >> >> On 12/11/2013 02:11, Andreas Joachim Peters wrote: >>> Hi Loic, >>> >>> I am finally doing the benchmark tool and I found a bunch of wrong parameter checks which can make the whole thing SEGV. >>> >>> All the RAID-6 codes have restrictions on the parameters but they are not correctly enforced for Liberation & Blaum-Roth codes in the CEPH wrapper class ... see text from PDF >>> >>> "Minimal Density RAID-6 codes are MDS codes based on binary matrices which satisfy a lower-bound on the number of non-zero entries. Unlike Cauchy coding, the bit-matrix elements do not correspond to elements in GF (2 w ). Instead, the bit-matrix itself has the proper MDS property. Minimal Density RAID-6 codes perform faster than Reed-Solomon and Cauchy Reed-Solomon codes for the same parameters. Liberation coding, Liber8tion coding, and Blaum-Roth coding are three examples of this kind of coding that are supported in jerasure. >>> >>> With each of these codes, m must be equal to two and k must be less than or equal to w. The value of w has restrictions based on the code: >>> >>> • With Liberation coding, w must be a prime number [Pla08b]. >>> • With Blaum-Roth coding, w + 1 must be a prime number [BR99]. • With Liber8tion coding, w must equal 8 [Pla08a]. >>> >>> ... >>> >>> Do you add this fixes? >> >> Nice catch. I created and assigned to myself : http://tracker.ceph.com/issues/6754 >>> >>> For the benchmark suite it runs currently 308 different configurations for the 2 algorithm which make sense from the performance point of view and provides this output: >>> >>> >>> # ----------------------------------------------------------------- >>> # Erasure Coding Benchmark - (C) CERN 2013 - Andreas.Joachim.Peters@xxxxxxx >>> # Ram-Size=12614856704 Allocation-Size=100000000 >>> # ----------------------------------------------------------------- >>> # [ -BENCH- ] [ ] technique=memcpy speed=5.408 [GB/s] latency=9.245 ms >>> # [ -BENCH- ] [ ] technique=d=a^b^c-xor speed=4.377 [GB/s] latency=17.136 ms >>> # [ -BENCH- ] [001/304] technique=cauchy_good:k=05:m=2:w=8:lp=0:packet=00064:size=50000000 speed=1.308 [GB/s] latency=038 [ms] size-overhead=40 [%] >>> .. >>> .. >>> # [ -BENCH- ] [304/304] technique=liberation:k=24:m=2:w=29:lp=2:packet=65536:size=50000000 speed=0.083 [GB/s] latency=604 [ms] size-overhead=16 [%] >>> # ----------------------------------------------------------------- >>> # Erasure Code Performance Summary:: >>> # ----------------------------------------------------------------- >>> # RAM: 12.61 GB >>> # Allocation-Size 0.10 GB >>> # ----------------------------------------------------------------- >>> # Byte Initialization: 29.35 MB/s >>> # Memcpy: 5.41 GB/s >>> # Triple-XOR: 4.38 GB/s >>> # ----------------------------------------------------------------- >>> # Fastest RAID6 2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000 >>> # Fastest Triple Failure 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000 >>> # Fastest Quadr. Failure 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000 >>> # ----------------------------------------------------------------- >>> # ................................................................. >>> # Top 1 RAID6 2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=04096:size=50000000 >>> # Top 2 RAID6 2.72 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=16384:size=50000000 >>> # Top 3 RAID6 2.64 GB/s liber8tion:k=06:m=2:w=8:lp=0:packet=65536:size=50000000 >>> # Top 4 RAID6 2.60 GB/s liberation:k=07:m=2:w=7:lp=0:packet=16384:size=50000000 >>> # Top 5 RAID6 2.59 GB/s liberation:k=05:m=2:w=7:lp=0:packet=04096:size=50000000 >>> # ................................................................. >>> # Top 1 Triple 0.96 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=04096:size=50000000 >>> # Top 2 Triple 0.94 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=16384:size=50000000 >>> # Top 3 Triple 0.93 GB/s cauchy_good:k=06:m=3:w=8:lp=0:packet=65536:size=50000000 >>> # Top 4 Triple 0.89 GB/s cauchy_good:k=07:m=3:w=8:lp=0:packet=04096:size=50000000 >>> # Top 5 Triple 0.87 GB/s cauchy_good:k=05:m=3:w=8:lp=0:packet=04096:size=50000000 >>> # ................................................................. >>> # Top 1 Quadr. 0.66 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=04096:size=50000000 >>> # Top 2 Quadr. 0.65 GB/s cauchy_good:k=07:m=4:w=8:lp=0:packet=04096:size=50000000 >>> # Top 3 Quadr. 0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=16384:size=50000000 >>> # Top 4 Quadr. 0.64 GB/s cauchy_good:k=05:m=4:w=8:lp=0:packet=04096:size=50000000 >>> # Top 5 Quadr. 0.64 GB/s cauchy_good:k=06:m=4:w=8:lp=0:packet=65536:size=50000000 >>> # ................................................................. >>> >>> It takes around 30 second on my box. >> >> >> That looks great :-) If I understand correctly, it means https://github.com/ceph/ceph/pull/740 will no longer have benchmarks as they are moved to a separate program. Correct ? >> >>> I will add a measurement how the XOR and the 3 top algorithms scale with the number of cores and make the object-size configurable from the command line. Anything else ? >> >> It would be convenient to run this from a "workunit" ( i.e. a script in ceph/qa/workunits/ ) so that it can later be run by teuthology integration tests. That could be used to show regression. >> >> Shall I add the possiblity to test a single user specified configuration via command line arguments? >>> >> I would need to play with it to comment usefully. >> >> Cheers >> > -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature