Re: Comments on Ceph distributed parity implementation

Harvey Skinner <hpmpec2a@xxxxxxxxx> · Sun, 23 Jun 2013 19:26:38 -0700

hi all,

I agree that the OSD level (storage nodes) is where the
encoding/reconstruct should be.   From an overall solution
perspective, client nodes are sized for the apps they are deploying.
It would be a real push to get them to also over provision app/client
servers to take on the additional processing needed for encoding, etc.
  Better to have this as part of the storage node sizing criteria; is
the end user is going to use erasure-encoding or replication or both
(different pools) on the same storage node platform.

Harvey

On Fri, Jun 21, 2013 at 5:08 PM, Paul Von-Stamwitz
<PVonStamwitz@xxxxxxxxxxxxxx> wrote:
>
> On 06/21/2013 1:29 AM, Loic Dachary wrote:
>>
>> On 06/21/2013 03:23 AM, Paul Von-Stamwitz wrote:
>> >
>> > On 06/20/2013 11:26 PM, Loic Dachary wrote:
>> >>
>> >> I wrote down a short description of the read/write path I plan to
>> >> implement in ceph :
>> > https://github.com/dachary/ceph/blob/wip-
>> 4929/doc/dev/osd_internals/erasure-code.rst.
>> >> A quick look at the drawings will hopefully give you an idea. Each OSD
>> is > a disk connected to the others over the network. Although I chose
>> K+M
>> = 5 > I suspect the most common use case will be around K+M = 7+3 = 10
>> >
>> > Hi Loic,
>> >
>> > A couple questions regarding your proposal:
>> >
>> > Where would encode/decode occur? Client or OSD? Previous discussions
>> seemed to want it at the OSD level. Your proposal seems that it could be
>> either.
>>
>> I think it should be in the OSD. Although it would be possible to
>> implement it in the client, it would have to be implemented in the OSD
>> anyway for scrubbing. Therefore I think it is simpler to implement it in
>> the OSD as a first step.
>
> Agreed. Encode/decode/reconstruction belongs in the OSD layer. There is a
> possibility that the client and OSDs can work in tandem. For example, the
> client already does sharding. But, the OSD is currently responsible for
> replication, so it should be responsible for parity. For now, we should
> strive for encoding to be transparent to the client.
>
>>
>> >
>> > What about partial reads? For example, if only 'H' was requested, would
>> you decode the entire object first?
>>
>> Unless you see a good reason to implement this optimization from the
>> start,
>> I think the entire object would be decoded.
>>
>> > Or, read from OSD1 and reconstruct on error?
>>
>> I think that's what we want, ultimately. And also to encode / decode
>> large
>> objects (1GB for instance) as a sequence of independant regions (4MB for
>> instance or smaller).
>>
>> I updated https://github.com/dachary/ceph/blob/wip-
>> 4929/doc/dev/osd_internals/erasure-code.rst to reflect our discussion.
>>
>> The first, simplest implementation is likely to be fit to use with RGW
>> and
>> probably too slow to use with RBD. Do you think we should try to optimize
>> for RBD right now ?
>
> Yes, RGW is the obvious best candidate for the first implementation. We
> don't need to implement for RBD and CephFS now, but we should consider how
> the design would handle other applications in the future. The alternative is
> to optimize purely for RGW and provide an API/plug-in capability suggested
> by Harvey Skinner to make way for optimized solutions for other
> applications.
>
> All the best
> pvs
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html