Hi Loic, I run a benchmark with the changed code tomorrow ... I actually had to insert some of my realtime benchmark macro's into your Jerasure code to see the different time fractions between buffer preparation & encoding step, but for you QA suite it is probably enough to get a total value after your fix. I will send you a program sampling the performance at different buffer sizes and encoding types. I changed my code to use vector operations (128-bit XOR's) and it gives another 10% gain. I also want to try out if it makes sense to do the CRC32C computation in-line in the encoding step and compare it with the two step procedure first encoding all blocks, then CRC32C on all blocks. Cheers Andreas. ________________________________________ From: Loic Dachary [loic@xxxxxxxxxxx] Sent: 21 September 2013 17:11 To: Andreas Joachim Peters Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: Re: CEPH Erasure Encoding + OSD Scalability Hi Andreas, It's probably too soon to be smart about reducing the number of copies, but you're right : this copy is not necessary. The following pull request gets rid of it: https://github.com/ceph/ceph/pull/615 Cheers On 20/09/2013 18:49, Loic Dachary wrote: > Hi, > > This is a first attempt at avoiding unnecessary copy: > > https://github.com/dachary/ceph/blob/03445a5926cd073c11cd8693fb110729e40f35fa/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L66 > > I'm not sure how it could be made more readable / terse with bufferlist iterators. Any kind of hint would be welcome :-) > > Cheers > > On 20/09/2013 17:36, Sage Weil wrote: >> On Fri, 20 Sep 2013, Loic Dachary wrote: >>> Hi Andreas, >>> >>> Great work on these benchmarks ! It's definitely an incentive to improve as much as possible. Could you push / send the scripts and sequence of operations you've used ? I'll reproduce this locally while getting rid of the extra copy. It would be useful to capture that into a script that can be conveniently run from the teuthology integrations tests to check against performance regressions. >>> >>> Regarding the 3P implementation, in my opinion it would be very valuable for some people who prefer low CPU consumption. And I'm eager to see more than one plugin in the erasure code plugin directory ;-) >> >> One way to approach this might be to make a bufferlist 'multi-iterator' >> that you give you bufferlist::iterator's and will give you back a pair of >> points and length for each contiguous segment. This would capture the >> annoying iterator details and let the user focus on processing chunks that >> are as large as possible. >> >> sage >> >> >> > >>> Cheers >>> >>> On 20/09/2013 13:35, Andreas Joachim Peters wrote: >>>> Hi Loic, >>>> >>>> I have now some benchmarks on a Xeon 2.27 GHz 4-core with gcc 4.4 (-O2) for ENCODING based on the CEPH Jerasure port. >>>> I measured for objects from 128k to 512 MB with random contents (if you encode 1 GB objects you see slow downs due to caching inefficiencies ...), otherwise results are stable for the given object sizes. >>>> >>>> I quote only the benchmark for ErasureCodeJerasureReedSolomonRAID6 (3,2) , the other are significantly slower (2-3x slower) and my 3P(3,2,1) implementation providing the same redundancy level like RS-Raid6[3,2] (double disk failure) but using more space (66% vs 100% overhead). >>>> >>>> The effect of out.c_str() is significant ( contributes with factor 2 slow-down for the best jerasure algorithm for [3,2] ). >>>> >>>> Averaged results for Objects Size 4MB: >>>> >>>> 1) Erasure CRS [3,2] - 2.6 ms buffer preparation (out.c_str()) - 2.4 ms encoding => ~780 MB/s >>>> 2) 3P [3,2,1] - 0,005 ms buffer preparation (3P adjusts the padding in the algorithm) - 0.87ms encoding => ~4.4 GB/s >>>> >>>> I think it pays off to avoid the copy in the encoding if it does not matter for the buffer handling upstream and pad only the last chunk. >>>> >>>> Last thing I tested is how performances scales with number of cores running 4 tests in parallel: >>>> >>>> Jerasure (3,2) limits at ~2,0 GB/s for a 4-core CPU (Xeon 2.27 GHz). >>>> 3P(3,2,1) limits ~8 GB/s for a 4-core CPU (Xeon 2.27 GHz). >>>> >>>> I also implemented the decoding for 3P, but didn't test yet all reconstruction cases. There is probably room for improvements using AVX support for XOR operations in both implementations. >>>> >>>> Before I invest more time, do think it is useful to have this fast 3P algorithm for double disk failures with 100% space overhead? Because I believe that people will always optimize for space and would rather use something like (10,2) even if the performance degrades and CPU consumption goes up?!? Let me know, no problem in any case! >>>> >>>> Finally I tested some combinations for ErasureCodeJerasureReedSolomonRAID6: >>>> >>>> (3,2) (4,2) (6,2) (8,2) (10,2) they all run around 780-800 MB/s >>>> >>>> Cheers Andreas. >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> Lo?c Dachary, Artisan Logiciel Libre >>> All that is necessary for the triumph of evil is that good people do nothing. >>> >>> > -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html