Re: erasure coding (sorry)

"Christopher LILJENSTOLPE" <cdl@xxxxxxxxxxx> · Thu, 18 Apr 2013 17:47:05 -0700

Supposedly, on 2013-Apr-18, at 14.26 PDT(-0700), someone claiming to be Sage Weil scribed:

> On Thu, 18 Apr 2013, Noah Watkins wrote:
>> On Apr 18, 2013, at 2:08 PM, Josh Durgin <josh.durgin@xxxxxxxxxxx> wrote:
>>
>>> I talked to some folks interested in doing a more limited form of this
>>> yesterday. They started a blueprint [1]. One of their ideas was to have
>>> erasure coding done by a separate process (or thread perhaps). It would
>>> use erasure coding on an object and then use librados to store the
>>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>>> place of the original object in the first pool.
>>
>> This sounds at a high-level similar to work out of Microsoft:
>>
>> https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
>>
>> The basic idea is to replicate first, then erasure code in the background.
>
> FWIW, I think a useful (and generic) concept to add to rados would be a
> redirect symlink sort of thing that says "oh, this object is over there is
> that other pool", such that client requests will be transparently
> redirected or proxied.  This will enable generic tiering type operations,
> and probably simplify/enable migration without a lot of additional
> complexity on the client side.

More to come, but I'm starting to think of a union mount of a fuse "re-directing" overlay.  The quick idea.

On the "hot" pool, the OSD's would write to the host FS as usual.  However, that FS is actually a light-weight fuse (at least for prototype) fs that passes almost everything right down to the file system.  As the OSD hits a capacity HWM, a watcher (asynchronous process), starts "evicting" objects from the OSD.  It does that by using a modified ceph client that calls zfec and uses CRUSH to place the resulting shards in the "cool" pool.  Once those are committed, it replaces the object in the "hot" OSD with a special token. This is repeated until a LWM is reached.  When the OSD gets a read request for that object, when the fuse shim sees the token, it knows to actually do a modified client fetch from the "cool" pool.  It returns the resulting object to the original requester and (potentially) stores the object back in the "hot" OSD (if you want a cache-like performance), replacing the token.  If necessary, some other object may get, in turn, evicted if the HWM is again breached.

We would also need to modify the repair mechanism for the deep scrub in the "cool" pool to account for the repair being a re-constitution of an invalid shard, rather than a copy (as there is only one copy of a given shard).

I'll get a bit more of a write-up today, hopefully, in the wiki.

	Christopher

>
> sage

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl
Attachment:
signature.asc

Description: OpenPGP digital signature