Re: Erasure encoding as a storage backend

Gregory Farnum <greg@xxxxxxxxxxx> · Sat, 4 May 2013 21:51:47 -0700



On Sat, May 4, 2013 at 11:47 AM, Noah Watkins <jayhawk@xxxxxxxxxxx> wrote:
>
> On May 4, 2013, at 11:36 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>
>>
>>
>> On 05/04/2013 08:27 PM, Noah Watkins wrote:
>>>
>>> On May 4, 2013, at 10:16 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>
>>>> it would be great to get feedback before the ceph summit to address the most prominent issues.
>>>
>>> One thing that has been in the back of my mind is how this proposal is influenced (if at all) by a future that includes declustered per-file raid in CephFS. I realize that may be a distant future, but it seems as though there could be a lot of overlap for the (non-client driven) rebuild/recovery component of such an architecture.
>>
>> Hi Noah,
>>
>> I'm not sure what declustered per-file raid is, which means it had no influence on this proposal ;-) Would you be so kind as to educate me ?
>
> I'm definitely far from an expert on the topic. But briefly the way I think about it is:
>
> Currently CephFS stripes a file byte stream across a set of objects (e.g. first MB in object 0, 2nd in object 1, etc..), and each of these objects is in turn replicated. Following a failure, PGs re-replicate objects.
>
> In client drive raid the striping algorithm is changed, and clients are calculating and distributing parity. In this case the parity rather than replication provides redundancy. So, one might consider storing objects in a pool with replication size 1. However, the standard PG that does replication wouldn't be able to handle faults correctly (parity rebuild, rather than re-replication), and a smart PG like the ErasureCodedPG would be needed.
>
> So it seems like the problems are related, but I'm not sure exactly how much overlap there is :)

I'm pretty sure we'd just want to use erasure-coded RADOS pools,
rather than trying to do any CephFS magic erasure encoding. Doing it
above the RADOS layers would introduce some very odd behaviors in
terms of losing objects, as you've mentioned, and requires the clients
to do a lot more network traffic for reads and writes.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html