Re: Upcoming Erasure coding

Wido den Hollander <wido@xxxxxxxx> · Tue, 24 Dec 2013 10:22:22 +0100

On 12/24/2013 09:34 AM, Christian Balzer wrote:

Hello Loic,

On Tue, 24 Dec 2013 08:29:38 +0100 Loic Dachary wrote:

On 24/12/2013 05:42, Christian Balzer wrote:

Hello,

from what has been written on the roadmap page and here, I assume that
the erasure coding option with Firefly will be (unsurprisingly) a pool
option.

Hi Christian,

You are correct. It is set when a pool of type "erasure" is created for
instance :

    ceph osd pool create poolname 12 12 erasure

creates an erasure pool "poolname" with 12 pg and uses the default
erasure code plugin ( jerasure ) with parameters K=6, M=2 meaning each
object is spread over 6+2 OSDs and you can sustain the loss of two OSDs.
Thanks for that info.
I'm sure it will not use OSDs on the same server(s) if possible.
Will it attempt to use distribute those 8 OSDs amongst failure domains, as
in, put it on 8 servers if those are available, use different racks if
they are available, etc. ?

It can be changed with

    ceph osd pool create poolname 12 12 erasure erasure-code-k=2
erasure-code-m=1

which is the equivalent of having 2 replicas using 1.5 times the space
instead of 2 times.

Neat. ^.^

Don't get your hopes up, let me explain that below.

Given the nature of this beast I doubt that it can just be switched on
with a live pool, right?

Yes.

If so, what are the thoughts/plans to allow for a seamless and
transparent migration, other than a "deploy more hardware, create a
new pool, migrate everything by hand (with potential service
interruptions)" approach?

One possibility is to use tiering. An erasure code pool is created and
set to receive objects demoted from the replica pool when they have not
been used in a long time. If the object is accessed from the replica
pool, it is first promoted back to it and this is transparent to the
user ( modulo the delay of promoting it when accessed again ).

Ah, but that sounds a lot like my proposal, w/o the benefit of being to
recycle (reconfigure) your old pool/hardware in the end.

Lets assume a Ceph cluster with OSDs being already quite full and maybe
hundreds of of VMs using RBD images on it.
Migration your way wouldn't really improve things (storage density) much.
While in my way you get to fondle the pool name for each VM as you migrate
them, which won't be a live migration either.

IIRC Erasure Encoding doesn't work well with RBD, if it even works at 
all due to the fact that you can't update a object, but you have to 
completely rewrite the whole object.

So Erasure encoding works great with the RADOS Gateway, but it doesn't 
with RBD or CephFS.

When using Erasure you should also be aware that recovery traffic can be 
10x the traffic of the traffic you would see with a replicated pool.

Wido

P.S.: Loic, please correct me if I'm wrong :)

I guess people with use this feature only with new pools/deployments in
many cases then.

Regards,

Christian

--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com