Re: erasure coding (sorry)

"Christopher LILJENSTOLPE" <cdl@xxxxxxxxxxx> · Thu, 18 Apr 2013 17:29:24 -0700

Supposedly, on 2013-Apr-18, at 14.08 PDT(-0700), someone claiming to be Josh Durgin scribed:

> On 04/18/2013 01:47 PM, Sage Weil wrote:
>> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
>>> sorry to bring this up again, googling revealed some people don't like the subject [anymore].
>>>
>>> but I'm working on a new +- 3PB cluster for storage of immutable files.
>>> and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now)
>>> For this use case, my impression is erasure coding would make a lot of sense
>>> (though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files.
>>> inbound traffic would be minimal)
>>>
>>> I know that the answer a while ago was "no plans to implement erasure coding", has this changed?
>>> if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc.
>>
>> We would love to do it, but it is not a priority at the moment (things
>> like multi-site replication are in much higher demand).  That of course
>> doesn't prevent someone outside of Inktank from working on it :)
>>
>> The main caveat is that it will be complicate.  For an initial
>> implementation, the full breadth of the rados API probably wouldn't be
>> support for erasure/parity encoded pools (thinkgs like rados classes and
>> the omap key/value api get tricky when you start talking about parity).
>> But for many (or even most) use cases, objects are just bytes, and those
>> restrictions are just fine.
>
> I talked to some folks interested in doing a more limited form of this
> yesterday. They started a blueprint [1]. One of their ideas was to have
> erasure coding done by a separate process (or thread perhaps). It would
> use erasure coding on an object and then use librados to store the
> rasure-encoded pieces in a separate pool, and finally leave a marker in
> place of the original object in the first pool.
>
> When the osd detected this marker, it would proxy the request to the
> erasure coding thread/process which would service the request on the
> second pool for reads, and potentially make writes move the data back to
> the first pool in a tiering sort of scenario.
>
> I might have misremembered some details, but I think it's an
> interesting way to get many of the benefits of erasure coding with a relatively small amount of work compared to a fully native osd solution.

Greetings,

	I'm one of those individuals :)  Our thinking is evolving on this, and I think we can keep most of the work out of the main machinery of ceph, and simply require a modified client that runs the "proxy" function on the "hot" pool OSDs. Even wondering if it could be prototyped in fuse.  I will be writing this up in the next day or two in the blueprint below.  Josh has the idea basically correct.

>
> Josh

Christopher

>
> [1] http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl
Attachment:
signature.asc

Description: OpenPGP digital signature