Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed: > On Thu, 18 Apr 2013 16:09:52 -0500 > Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote: > >> On 04/18/2013 04:08 PM, Josh Durgin wrote: >>> On 04/18/2013 01:47 PM, Sage Weil wrote: >>>> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote: >>>>> sorry to bring this up again, googling revealed some people don't >>>>> like the subject [anymore]. >>>>> >>>>> but I'm working on a new +- 3PB cluster for storage of immutable files. >>>>> and it would be either all cold data, or mostly cold. 150MB avg >>>>> filesize, max size 5GB (for now) >>>>> For this use case, my impression is erasure coding would make a lot >>>>> of sense >>>>> (though I'm not sure about the computational overhead on storing and >>>>> loading objects..? outbound traffic would peak at 6 Gbps, but I can >>>>> make it way less and still keep a large cluster, by taking away the >>>>> small set of hot files. >>>>> inbound traffic would be minimal) >>>>> >>>>> I know that the answer a while ago was "no plans to implement erasure >>>>> coding", has this changed? >>>>> if not, is anyone aware of a similar system that does support it? I >>>>> found QFS but that's meant for batch processing, has a single >>>>> 'namenode' etc. >>>> >>>> We would love to do it, but it is not a priority at the moment (things >>>> like multi-site replication are in much higher demand). That of course >>>> doesn't prevent someone outside of Inktank from working on it :) >>>> >>>> The main caveat is that it will be complicate. For an initial >>>> implementation, the full breadth of the rados API probably wouldn't be >>>> support for erasure/parity encoded pools (thinkgs like rados classes and >>>> the omap key/value api get tricky when you start talking about parity). >>>> But for many (or even most) use cases, objects are just bytes, and those >>>> restrictions are just fine. >>> >>> I talked to some folks interested in doing a more limited form of this >>> yesterday. They started a blueprint [1]. One of their ideas was to have >>> erasure coding done by a separate process (or thread perhaps). It would >>> use erasure coding on an object and then use librados to store the >>> rasure-encoded pieces in a separate pool, and finally leave a marker in >>> place of the original object in the first pool. >>> >>> When the osd detected this marker, it would proxy the request to the >>> erasure coding thread/process which would service the request on the >>> second pool for reads, and potentially make writes move the data back to >>> the first pool in a tiering sort of scenario. >>> >>> I might have misremembered some details, but I think it's an >>> interesting way to get many of the benefits of erasure coding with a >>> relatively small amount of work compared to a fully native osd solution. >>> >>> Josh >> >> Neat. :) >> > > @Bryan: I did come across cleversafe. all the articles around it seemed promising, > but unfortunately it seems everything related to the cleversafe open source project > somehow vanished from the internet. (e.g. http://www.cleversafe.org/) quite weird... Yea - in a previous incarnation I looked at cleversafe to do something similar a few years ago. It is odd that the cleversafe.org stuff did disapear. However, tahoe-lafs also does encoding, and their package (zfec) [1] may be leverageable. > > @Sage: interesting. I thought it would be more relatively simple if one assumes > the restriction of immutable files. I'm not familiar with those ceph specifics you're mentioning. > When building an erasure codes-based system, maybe there's ways to reuse existing ceph > code and/or allow some integration with replication based objects, without aiming for full integration or > full support of the rados api, based on some tradeoffs. I think this might sit UNDER the rados api. I would certainly want to leverage CRUSH to place the shards, however (great tool, no reason to re-invent the wheel). > > @Josh, that sounds like an interesting approach. Too bad that page doesn't contain any information yet :) Give me time :) - openstack has kept me a bit busy… May also be a factor of "design at keyboard" :) > > Dieter Christopher [1] https://tahoe-lafs.org/trac/zfec -- 李柯睿 Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf Check my calendar availability: https://tungle.me/cdl
Attachment:
signature.asc
Description: OpenPGP digital signature