Re: erasure coding (sorry)

Josh Durgin <josh.durgin@xxxxxxxxxxx> · Thu, 18 Apr 2013 14:08:02 -0700

On 04/18/2013 01:47 PM, Sage Weil wrote:
On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
sorry to bring this up again, googling revealed some people don't like the subject [anymore].

but I'm working on a new +- 3PB cluster for storage of immutable files.
and it would be either all cold data, or mostly cold. 150MB avg filesize, max size 5GB (for now)
For this use case, my impression is erasure coding would make a lot of sense
(though I'm not sure about the computational overhead on storing and loading objects..? outbound traffic would peak at 6 Gbps, but I can make it way less and still keep a large cluster, by taking away the small set of hot files.
inbound traffic would be minimal)

I know that the answer a while ago was "no plans to implement erasure coding", has this changed?
if not, is anyone aware of a similar system that does support it? I found QFS but that's meant for batch processing, has a single 'namenode' etc.

We would love to do it, but it is not a priority at the moment (things
like multi-site replication are in much higher demand).  That of course
doesn't prevent someone outside of Inktank from working on it :)

The main caveat is that it will be complicate.  For an initial
implementation, the full breadth of the rados API probably wouldn't be
support for erasure/parity encoded pools (thinkgs like rados classes and
the omap key/value api get tricky when you start talking about parity).
But for many (or even most) use cases, objects are just bytes, and those
restrictions are just fine.

I talked to some folks interested in doing a more limited form of this
yesterday. They started a blueprint [1]. One of their ideas was to have
erasure coding done by a separate process (or thread perhaps). It would
use erasure coding on an object and then use librados to store the
rasure-encoded pieces in a separate pool, and finally leave a marker in
place of the original object in the first pool.

When the osd detected this marker, it would proxy the request to the
erasure coding thread/process which would service the request on the
second pool for reads, and potentially make writes move the data back to
the first pool in a tiering sort of scenario.

I might have misremembered some details, but I think it's an
interesting way to get many of the benefits of erasure coding with a 
relatively small amount of work compared to a fully native osd solution.

Josh

[1] 
http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html