Re: Upcoming Erasure coding

Loic Dachary <loic@xxxxxxxxxxx> · Tue, 24 Dec 2013 16:33:49 +0100

On 24/12/2013 10:22, Wido den Hollander wrote:
> On 12/24/2013 09:34 AM, Christian Balzer wrote:
>>
>> Hello Loic,
>>
>> On Tue, 24 Dec 2013 08:29:38 +0100 Loic Dachary wrote:
>>
>>>
>>>
>>> On 24/12/2013 05:42, Christian Balzer wrote:
>>>>
>>>> Hello,
>>>>
>>>> from what has been written on the roadmap page and here, I assume that
>>>> the erasure coding option with Firefly will be (unsurprisingly) a pool
>>>> option.
>>>
>>> Hi Christian,
>>>
>>> You are correct. It is set when a pool of type "erasure" is created for
>>> instance :
>>>
>>>     ceph osd pool create poolname 12 12 erasure
>>>
>>> creates an erasure pool "poolname" with 12 pg and uses the default
>>> erasure code plugin ( jerasure ) with parameters K=6, M=2 meaning each
>>> object is spread over 6+2 OSDs and you can sustain the loss of two OSDs.
>> Thanks for that info.
>> I'm sure it will not use OSDs on the same server(s) if possible.
>> Will it attempt to use distribute those 8 OSDs amongst failure domains, as
>> in, put it on 8 servers if those are available, use different racks if
>> they are available, etc. ?
>>
>>> It can be changed with
>>>
>>>     ceph osd pool create poolname 12 12 erasure erasure-code-k=2
>>> erasure-code-m=1
>>>
>>> which is the equivalent of having 2 replicas using 1.5 times the space
>>> instead of 2 times.
>>>
>> Neat. ^.^
>>
> 
> Don't get your hopes up, let me explain that below.
> 
>>>>
>>>> Given the nature of this beast I doubt that it can just be switched on
>>>> with a live pool, right?
>>>
>>> Yes.
>>>>
>>>> If so, what are the thoughts/plans to allow for a seamless and
>>>> transparent migration, other than a "deploy more hardware, create a
>>>> new pool, migrate everything by hand (with potential service
>>>> interruptions)" approach?
>>>>
>>>
>>> One possibility is to use tiering. An erasure code pool is created and
>>> set to receive objects demoted from the replica pool when they have not
>>> been used in a long time. If the object is accessed from the replica
>>> pool, it is first promoted back to it and this is transparent to the
>>> user ( modulo the delay of promoting it when accessed again ).
>>>
>> Ah, but that sounds a lot like my proposal, w/o the benefit of being to
>> recycle (reconfigure) your old pool/hardware in the end.
>>
>> Lets assume a Ceph cluster with OSDs being already quite full and maybe
>> hundreds of of VMs using RBD images on it.
>> Migration your way wouldn't really improve things (storage density) much.
>> While in my way you get to fondle the pool name for each VM as you migrate
>> them, which won't be a live migration either.
>>
> 
> IIRC Erasure Encoding doesn't work well with RBD, if it even works at all due to the fact that you can't update a object, but you have to completely rewrite the whole object.
> 
> So Erasure encoding works great with the RADOS Gateway, but it doesn't with RBD or CephFS.
> 
> When using Erasure you should also be aware that recovery traffic can be 10x the traffic of the traffic you would see with a replicated pool.
> 
> Wido
> 
> P.S.: Loic, please correct me if I'm wrong :)

You are correct : erasure code pools will not support all operations at first. They will be suitable for use with the tiering scenario I described. And most probably with the majority of operations done by radosgw. But the lack of support for partial writes makes it impossible to use it as an RBD pool.

That raises an interesting question : what would be the benefit of having an erasure coded RBD pool instead of a replica RBD pool with an erasure coded second tier ? In other words, is there a compelling reason to want:

RBD => erasure coded pool

instead of

RBD => replica pool => erasure code pool

where the objects are automatically moved to the erasure code pool if they are not used for more than X days. 

Cheers

>> I guess people with use this feature only with new pools/deployments in
>> many cases then.
>>
>> Regards,
>>
>> Christian
>>
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com