Re: Upcoming Erasure coding

Christian Balzer <chibi@xxxxxxx> · Thu, 26 Dec 2013 17:09:28 +0900

Hello,

On Tue, 24 Dec 2013 16:33:49 +0100 Loic Dachary wrote:
> 
> 
> On 24/12/2013 10:22, Wido den Hollander wrote:
> > On 12/24/2013 09:34 AM, Christian Balzer wrote:
> >>
> >> Hello Loic,
> >>
> >> On Tue, 24 Dec 2013 08:29:38 +0100 Loic Dachary wrote:
> >>
> >>>
> >>>
> >>> On 24/12/2013 05:42, Christian Balzer wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> from what has been written on the roadmap page and here, I assume
> >>>> that the erasure coding option with Firefly will be
> >>>> (unsurprisingly) a pool option.
> >>>
> >>> Hi Christian,
> >>>
> >>> You are correct. It is set when a pool of type "erasure" is created
> >>> for instance :
> >>>
> >>>     ceph osd pool create poolname 12 12 erasure
> >>>
> >>> creates an erasure pool "poolname" with 12 pg and uses the default
> >>> erasure code plugin ( jerasure ) with parameters K=6, M=2 meaning
> >>> each object is spread over 6+2 OSDs and you can sustain the loss of
> >>> two OSDs.
> >> Thanks for that info.
> >> I'm sure it will not use OSDs on the same server(s) if possible.
> >> Will it attempt to use distribute those 8 OSDs amongst failure
> >> domains, as in, put it on 8 servers if those are available, use
> >> different racks if they are available, etc. ?
> >>
> >>> It can be changed with
> >>>
> >>>     ceph osd pool create poolname 12 12 erasure erasure-code-k=2
> >>> erasure-code-m=1
> >>>
> >>> which is the equivalent of having 2 replicas using 1.5 times the
> >>> space instead of 2 times.
> >>>
> >> Neat. ^.^
> >>
> > 
> > Don't get your hopes up, let me explain that below.
> > 
> >>>>
> >>>> Given the nature of this beast I doubt that it can just be switched
> >>>> on with a live pool, right?
> >>>
> >>> Yes.
> >>>>
> >>>> If so, what are the thoughts/plans to allow for a seamless and
> >>>> transparent migration, other than a "deploy more hardware, create a
> >>>> new pool, migrate everything by hand (with potential service
> >>>> interruptions)" approach?
> >>>>
> >>>
> >>> One possibility is to use tiering. An erasure code pool is created
> >>> and set to receive objects demoted from the replica pool when they
> >>> have not been used in a long time. If the object is accessed from
> >>> the replica pool, it is first promoted back to it and this is
> >>> transparent to the user ( modulo the delay of promoting it when
> >>> accessed again ).
> >>>
> >> Ah, but that sounds a lot like my proposal, w/o the benefit of being
> >> to recycle (reconfigure) your old pool/hardware in the end.
> >>
> >> Lets assume a Ceph cluster with OSDs being already quite full and
> >> maybe hundreds of of VMs using RBD images on it.
> >> Migration your way wouldn't really improve things (storage density)
> >> much. While in my way you get to fondle the pool name for each VM as
> >> you migrate them, which won't be a live migration either.
> >>
> > 
> > IIRC Erasure Encoding doesn't work well with RBD, if it even works at
> > all due to the fact that you can't update a object, but you have to
> > completely rewrite the whole object.
> > 
Ah yes, of course...

> > So Erasure encoding works great with the RADOS Gateway, but it doesn't
> > with RBD or CephFS.
> > 
> > When using Erasure you should also be aware that recovery traffic can
> > be 10x the traffic of the traffic you would see with a replicated pool.
> > 
> > Wido
> > 
> > P.S.: Loic, please correct me if I'm wrong :)
> 
> You are correct : erasure code pools will not support all operations at
> first. They will be suitable for use with the tiering scenario I
> described. And most probably with the majority of operations done by
> radosgw. But the lack of support for partial writes makes it impossible
> to use it as an RBD pool.
> 
Nods, w/o partial writes that would be very ugly indeed.

> That raises an interesting question : what would be the benefit of
> having an erasure coded RBD pool instead of a replica RBD pool with an
> erasure coded second tier ? In other words, is there a compelling reason
> to want:
> 
> RBD => erasure coded pool
> 
> instead of
> 
> RBD => replica pool => erasure code pool
> 
> where the objects are automatically moved to the erasure code pool if
> they are not used for more than X days. 
> 
Now that I know about this limitation, your suggestion of a tiered
erasure code pool makes of course all the sense in the world. 
I would assume that enough demoting and promoting would be going on to have
a measurable effect, but of course that depends on the block allocation
strategies of the VM (filesystem) in question. One guesses BTRFS would be
the worst offender here with CoW.

Thanks a lot for that info, however deflating of my hopes it was. ^o^

Christian

> Cheers
> 
> >> I guess people with use this feature only with new pools/deployments
> >> in many cases then.
> >>
> >> Regards,
> >>
> >> Christian
> >>
> > 
> > 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com