Re: Upcoming Erasure coding

Loic Dachary <loic@xxxxxxxxxxx> · Tue, 24 Dec 2013 16:20:24 +0100

On 24/12/2013 09:34, Christian Balzer wrote:
> 
> Hello Loic,
> 
> On Tue, 24 Dec 2013 08:29:38 +0100 Loic Dachary wrote:
> 
>>
>>
>> On 24/12/2013 05:42, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> from what has been written on the roadmap page and here, I assume that
>>> the erasure coding option with Firefly will be (unsurprisingly) a pool
>>> option.
>>
>> Hi Christian,
>>
>> You are correct. It is set when a pool of type "erasure" is created for
>> instance :
>>
>>    ceph osd pool create poolname 12 12 erasure
>>
>> creates an erasure pool "poolname" with 12 pg and uses the default
>> erasure code plugin ( jerasure ) with parameters K=6, M=2 meaning each
>> object is spread over 6+2 OSDs and you can sustain the loss of two OSDs.
> Thanks for that info. 
> I'm sure it will not use OSDs on the same server(s) if possible. 
> Will it attempt to use distribute those 8 OSDs amongst failure domains, as
> in, put it on 8 servers if those are available, use different racks if
> they are available, etc. ?

The rules are defined by the crush map, using almost the same method as replicas. The only difference is that replicas use the "firstn" step and erasure use the "indep" step. The firstn step may shuffle the order of OSDs when one has to be replaced and that does not cause any trouble. For instance

firstn => [4,7,2]
2 dies
firstn => [5,4,7]

OSDs 4 and 7 shifted position. With "indep" it would be more like:

indep => [4,7,2]
2 dies
indep => [4,7,5]

For erasure code the position of the OSD in the list matters and it costs to change it. Using indep reduces the probability of such a change and therefore the amount of data that needs to be moved when an OSD dies.

Cheers

> 
>> It can be changed with
>>
>>    ceph osd pool create poolname 12 12 erasure erasure-code-k=2
>> erasure-code-m=1 
>>
>> which is the equivalent of having 2 replicas using 1.5 times the space
>> instead of 2 times.
>>
> Neat. ^.^
> 
>>>
>>> Given the nature of this beast I doubt that it can just be switched on
>>> with a live pool, right?
>>
>> Yes. 
>>>
>>> If so, what are the thoughts/plans to allow for a seamless and
>>> transparent migration, other than a "deploy more hardware, create a
>>> new pool, migrate everything by hand (with potential service
>>> interruptions)" approach?
>>>
>>
>> One possibility is to use tiering. An erasure code pool is created and
>> set to receive objects demoted from the replica pool when they have not
>> been used in a long time. If the object is accessed from the replica
>> pool, it is first promoted back to it and this is transparent to the
>> user ( modulo the delay of promoting it when accessed again ).
>>
> Ah, but that sounds a lot like my proposal, w/o the benefit of being to
> recycle (reconfigure) your old pool/hardware in the end.
> 
> Lets assume a Ceph cluster with OSDs being already quite full and maybe
> hundreds of of VMs using RBD images on it. 
> Migration your way wouldn't really improve things (storage density) much.
> While in my way you get to fondle the pool name for each VM as you migrate
> them, which won't be a live migration either.
> 
> I guess people with use this feature only with new pools/deployments in
> many cases then.
> 
> Regards,
> 
> Christian
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com