Re: CEPH Erasure Encoding + OSD Scalability

Loic Dachary <loic@xxxxxxxxxxx> · Sat, 14 Sep 2013 20:04:39 +0200

Hi Andreas,

On 14/09/2013 16:59, Andreas Joachim Peters wrote:> Hi Loic,
>
> I finally run/read the code of the erasure encoding.

Great !

> What I noticed is, that in your implementation you always copy the data to encode once because you add a padding block to the bufferlist and then call "out.c_str()", which calls bufferlist::rebuild and then new with the full size of all chunks and the it copies the input data. Please correct me if I am wrong ... couldn't you just allocate the additional redundancy chunks and return bufferptr pointing into the 'in' bufferlist ?

I assume you're referring to

https://github.com/ceph/ceph/blob/e9e53912503259326a7877bda31c4360302c2c34/src/osd/ErasureCodePluginJerasure/ErasureCodeJerasure.cc#L78

and indeed it implies an extra copy because of the padding. The optimization you're suggesting, if I get it right would only require an extra copy of the last data chunk. The code would extract the char * from in with c_str() before padding ( hence no rebuild ). Feed data[] with pointers to this but the last one if it requires padding. Allocate + copy a chunk + pad the last chunk if necessary. The allocated area can be made big enough to accomodate for the coding chunks. That would reduce the copy to the minimum. It also means that the returned bufferlist has to properly reference the input buffer and that the caller must not modify the content of *in* after calling encode otherwise it may have a side effect on the *encoded* result because they really are the same pointer.

>
> Another question is, why 'in' in the encode function is a list of buffers? Maybe this is the natural interface object in CEPH IO, don't know ... the implementation would concatenate them and produce chunks for the merged block ...

You guess right. Initially I had the encode function accept a bufferptr instead of a bufferlist but it's not the preferred API data structure to convey a buffer.

> I will try to run a benchmark to see, if the additional copy has a visible impact on the performance, however it looks unnecessary.

Indeed there should be a way to avoid this extra copy.

> I am also more or less finished with the 3 + (3XOR) implementation ... will do also a benchmark with this and let you know the result.

Cool !

> Last question  a little bit out of context, I did some benchmark about librados and latency. I see a latency of 1ms to read/stat objects of very small size (5 bytes in this case). If we (re-)write such an object with a 3-fold replica configuration on a 10 GBit setup with 1000 disks I see a latency of 80 ms per object. If I append it is 75 ms. If we do a massive test with the benchmark tool the total object creation rate saturates at 20kHz which is ok however the individual latency is higher than I would expect ?
>
> Is there something in the OSD delaying communication since I don't believe it takes 80 ms to sync 5 bytes on an idle pool to a harddisk with a network roundtrip time of far less than a ms ?

I suggest you start a separate thread for this, chances are your question will not be noticed otherwise.

Cheers

> Cheers, Andreas.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> ________________________________________
> From: Loic Dachary [loic@xxxxxxxxxxx]
> Sent: 25 August 2013 13:49
> To: Andreas Joachim Peters
> Cc: Ceph Development
> Subject: Re: CEPH Erasure Encoding + OSD Scalability
>
> On 24/08/2013 21:41, Loic Dachary wrote:
>>
>>
>> On 24/08/2013 15:30, Andreas-Joachim Peters wrote:
>>> Hi Loic,
>>> I will start to review
>>
>> Cool :-)
>>
>> ...maybe you can briefly explain few things beforehand:
>>>
>>> 1) the buffer management  .... who allocates the output buffers for the encoding? Are they always malloced or does it use some generic CEPH buffer recyling functionality?
>>
>> The output bufferlist is allocated by the pluing and it is the responsibility of the caller to deallocate them. I will write doxygen documentation
>> https://github.com/ceph/ceph/pull/518/files#r5966727
>
> Hi Andreas,
>
> The documentation added today in
> https://github.com/dachary/ceph/blob/wip-5878/src/osd/ErasureCodeInterface.h
> will hopefully clarify things. It requires an understanding of https://github.com/ceph/ceph/blob/master/src/include/buffer.h
>
> Let me know if you have more questions.
>
>>
>>> 2) do you support to retrieve partial blocks or only the full 4M block? are decoded blocks cached for some time?
>>
>> This is outside of the scope of https://github.com/ceph/ceph/pull/518/files : the plugin can handle encode/decode of 128 bytes or 4M in the same way.
>>
>>> 3) do you want to tune the 2+1 basic code for performance or is it just proof of concept? If yes, then you should move over the encoding buffer with *ptr++ and use the largest available vector size for the used platform to perform XOR operations. I will send you an improved version of the loop if you want ...
>>
>> The 2+1 is just a proof of concept. I completed a first implementation of the jerasure plugin https://github.com/ceph/ceph/pull/538/files which is meant to be used as a default.
>>
>>> 4) if you are interested I can write also code for a (3+3) plugin which tolerates 2-3 lost stripes. (one has to add P3=A^B^C to my [3,2] proposal). Atleast it reduces the overhead from 3-fold replication from 300% => 200% ...
>>
>> It would be great to have such a plugin :-)
>>
>>> 5) will you add CRC32C checksums to the blocks (4M block or 4k pages?) or will this be a CEPH generic functionality for any kind of block?
>>
>> The idea is to have a CRC32C checksum per object / shard ( as described in http://ceph.com/docs/master/dev/osd_internals/erasure_coding/#glossary ) : it is the only way for scrubbing to figure out if a given shard is not corrupted and not too expensive since erasure coded pool only support full writes + append and not partial writes that would require to re-calculate the CRC32C for the whole shard each time one byte is changed.
>>
>>> 6) do you put a kind of header or magic into the encoded blocks to verify that your input blocks are actually corresponding?
>>
>> This has not been decided yet but I think it would be sensible to use the object attributes ( either xattr or leveldb ) to store meta information instead of creating a file format specifically designed for erasure code.
>>
>> Cheers
>>
>>> Cheers Andreas.
>>>
>>>
>>>
>>>
>>> On Fri, Aug 23, 2013 at 1:03 AM, Loic Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>> wrote:
>>>
>>>
>>>
>>>     On 22/08/2013 23:42, Andreas-Joachim Peters wrote:> Hi Loic,
>>>     > sorry for the late reply, I was on vacation ...  you are right, I did a simple logical mistake since I assumed you loose only the data stripes but never the parity stripes which is a very wrong assumption.
>>>     >
>>>     > So for testing you probably could just implement (2+1) and then move to jerasure or dual parity (4+2) where you build horizontal and diagonal parities.
>>>     >
>>>
>>>     Hi Andreas,
>>>
>>>     That's what I did :-) It would be great if you could review the proposed implementation at https://github.com/ceph/ceph/pull/518/files . I'll keep working on https://github.com/dachary/ceph/commit/83845a66ae1cba63c122c0ef7658b97b474c2bd2 tomorrow to create the jerasure plugin but it's not ready for review yet.
>>>
>>>     Cheers
>>>
>>>     > Cheers Andreas.
>>>     >
>>>     >
>>>     >
>>>     >
>>>     >
>>>     > On Mon, Aug 19, 2013 at 12:35 PM, Loic Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx> <mailto:loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>>> wrote:
>>>     >
>>>     >     Hi Andreas,
>>>     >
>>>     >     Trying to write minimal code as you suggested, for an example plugin. My first attempt at writing an erasure coding function. I don't get how you can rebuild P1 + A from P2 + B + C. I must be missing something obvious :-)
>>>     >
>>>     >     Cheers
>>>     >
>>>     >     On 07/07/2013 23:04, Andreas Joachim Peters wrote:
>>>     >     >
>>>     >     > Hi Loic,
>>>     >     > I don't think there is a better generic implementation. Just made a benchmark .. the Jerasure library with algorithm 'cauchy_good' gives 1.1 GB/s (Xeon 2.27 GHz) on a single core for a 4+2 encoding w=32. Just to give a feeling if you do 10+4 it is 300 MB/s .... there is a specialized implementation in QFS (Hadoop in C++) for (M+3) ... for curiosity I will make a benchmark with this to compare with Jerasure ...
>>>     >     >
>>>     >     > In any case I would do an optimized implementation for 3+2 which would be probably the most performant implementation having the same reliability like standard 3-fold replication in CEPH using only 53% of the space.
>>>     >     >
>>>     >     > 3+2 is trivial since you encode (A,B,C) with only two parity operations
>>>     >     > P1 = A^B
>>>     >     > P2 = B^C
>>>     >     > and reconstruct with one or two parity operations:
>>>     >     > A = P1^B
>>>     >     > B = P1^A
>>>     >     > B = P2^C
>>>     >     > C = P2^B
>>>     >     > aso.
>>>     >     >
>>>     >     > You can write this as a simple loop using advanced vector extensions on Intel (AVX). I can paste a benchmark tomorrow.
>>>     >     >
>>>     >     > Considering the crc32c-intel code you added ... I would provide a function which provides a crc32c checksum and detects if it can do it using SSE4.2 or implements just the standard algorithm e.g if you run in a virtual machine you need this emulation ...
>>>     >     >
>>>     >     > Cheers Andreas.
>>>     >     > ________________________________________
>>>     >     > From: Loic Dachary [loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx> <mailto:loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>>]
>>>     >     > Sent: 06 July 2013 22:47
>>>     >     > To: Andreas Joachim Peters
>>>     >     > Cc: ceph-devel@xxxxxxxxxxxxxxx <mailto:ceph-devel@xxxxxxxxxxxxxxx> <mailto:ceph-devel@xxxxxxxxxxxxxxx <mailto:ceph-devel@xxxxxxxxxxxxxxx>>
>>>     >     > Subject: Re: CEPH Erasure Encoding + OSD Scalability
>>>     >     >
>>>     >     > Hi Andreas,
>>>     >     >
>>>     >     > Since it looks like we're going to use jerasure-1.2, we will be able to try (C)RS using
>>>     >     >
>>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.c
>>>     >     > https://github.com/tsuraan/Jerasure/blob/master/src/cauchy.h
>>>     >     >
>>>     >     > Do you know of a better / faster implementation ? Is there a tradeoff between (C)RS and RS ?
>>>     >     >
>>>     >     > Cheers
>>>     >     >
>>>     >     > On 06/07/2013 15:43, Andreas-Joachim Peters wrote:
>>>     >     >> HI Loic,
>>>     >     >> (C)RS stands for the Cauchy Reed-Solomon codes which are based on pure parity operations, while the standard Reed-Solomon codes need more multiplications and are slower.
>>>     >     >>
>>>     >     >> Considering the checksumming ... for comparison the CRC32 code from libz run's on a 8-core Xeon at ~730 MB/s for small block sizes while SSE4.2 CRC32C checksum run's at ~2GByte/s.
>>>     >     >>
>>>     >     >> Cheers Andreas.
>>>     >     >>
>>>     >     >>
>>>     >     >>
>>>     >     >>
>>>     >     >> On Fri, Jul 5, 2013 at 11:23 PM, Loic Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx> <mailto:loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>> <mailto:loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx> <mailto:loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>>>> wrote:
>>>     >     >>
>>>     >     >>     Hi Andreas,
>>>     >     >>
>>>     >     >>     On 04/07/2013 23:01, Andreas Joachim Peters wrote:> Hi Loic,
>>>     >     >>     > thanks for the responses!
>>>     >     >>     >
>>>     >     >>     > Maybe this is useful for your erasure code discussion:
>>>     >     >>     >
>>>     >     >>     > as an example in our RS implementation we chunk a data block of e.g. 4M into 4 data chunks of 1M. Then we create a 2 parity chunks.
>>>     >     >>     >
>>>     >     >>     > Data & parity chunks are split into 4k blocks and these 4k blocks get a CRC32C block checksum each (SSE4.2 CPU extension => MIT library or BTRFS). This creates 0.1% volume overhead (4 bytes per 4096 bytes) - nothing compared to the parity overhead ...
>>>     >     >>     >
>>>     >     >>     > You can now easily detect data corruption using the local checksums and avoid to read any parity information and (C)RS decoding if there is no corruption detected. Moreover CRC32C computation is distributed over several (in this case 4) machines while (C)RS decoding would run on a single machine where you assemble a block ... and CRC32C is faster than (C)RS decoding (with SSE4.2) ...
>>>     >     >>
>>>     >     >>     What does (C)RS mean ? (C)Reed-Solomon ?
>>>     >     >>
>>>     >     >>     > In our case we write this checksum information separate from the original data ... while in a block-based storage like CEPH it would be probably inlined in the data chunk.
>>>     >     >>     > If an OSD detects to run on BRTFS or ZFS one could disable automatically the CRC32C code.
>>>     >     >>
>>>     >     >>     Nice. I did not know that was built-in :-)
>>>     >     >>     https://github.com/dachary/ceph/blob/wip-4929/doc/dev/osd_internals/erasure-code.rst#scrubbing
>>>     >     >>
>>>     >     >>     > (wouldn't CRC32C be also useful for normal CEPH block replication? )
>>>     >     >>
>>>     >     >>     I don't know the details of scrubbing but it seems CRC is already used by deep scrubbing
>>>     >     >>
>>>     >     >>     https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L2731
>>>     >     >>
>>>     >     >>     Cheers
>>>     >     >>
>>>     >     >>     > As far as I know with the RS CODEC we use you can either miss stripes (data =0) in the decoding process but you cannot inject corrupted stripes into the decoding process, so the block checksumming is important.
>>>     >     >>     >
>>>     >     >>     > Cheers Andreas.
>>>     >     >>
>>>     >     >>     --
>>>     >     >>     Loïc Dachary, Artisan Logiciel Libre
>>>     >     >>     All that is necessary for the triumph of evil is that good people do nothing.
>>>     >     >>
>>>     >     >>
>>>     >     >
>>>     >     > --
>>>     >     > Loïc Dachary, Artisan Logiciel Libre
>>>     >     > All that is necessary for the triumph of evil is that good people do nothing.
>>>     >     >
>>>     >     > --
>>>     >     > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>     >     > the body of a message to majordomo@xxxxxxxxxxxxxxx <mailto:majordomo@xxxxxxxxxxxxxxx> <mailto:majordomo@xxxxxxxxxxxxxxx <mailto:majordomo@xxxxxxxxxxxxxxx>>
>>>     >     > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>     >     >
>>>     >
>>>     >     --
>>>     >     Loïc Dachary, Artisan Logiciel Libre
>>>     >     All that is necessary for the triumph of evil is that good people do nothing.
>>>     >
>>>     >
>>>
>>>     --
>>>     Loïc Dachary, Artisan Logiciel Libre
>>>     All that is necessary for the triumph of evil is that good people do nothing.
>>>
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature