RE: Comments on Ceph distributed parity implementation

Paul Von-Stamwitz <PVonStamwitz@xxxxxxxxxxxxxx> · Fri, 21 Jun 2013 17:08:01 -0700

On 06/21/2013 1:29 AM, Loic Dachary wrote:
> 
> On 06/21/2013 03:23 AM, Paul Von-Stamwitz wrote:
> >
> > On 06/20/2013 11:26 PM, Loic Dachary wrote:
> >>
> >> I wrote down a short description of the read/write path I plan to
> >> implement in ceph :
> > https://github.com/dachary/ceph/blob/wip-
> 4929/doc/dev/osd_internals/erasure-code.rst.
> >> A quick look at the drawings will hopefully give you an idea. Each OSD
> is > a disk connected to the others over the network. Although I chose K+M
> = 5 > I suspect the most common use case will be around K+M = 7+3 = 10
> >
> > Hi Loic,
> >
> > A couple questions regarding your proposal:
> >
> > Where would encode/decode occur? Client or OSD? Previous discussions
> seemed to want it at the OSD level. Your proposal seems that it could be
> either.
> 
> I think it should be in the OSD. Although it would be possible to
> implement it in the client, it would have to be implemented in the OSD
> anyway for scrubbing. Therefore I think it is simpler to implement it in
> the OSD as a first step.

Agreed. Encode/decode/reconstruction belongs in the OSD layer. There is a possibility that the client and OSDs can work in tandem. For example, the client already does sharding. But, the OSD is currently responsible for replication, so it should be responsible for parity. For now, we should strive for encoding to be transparent to the client.

> 
> >
> > What about partial reads? For example, if only 'H' was requested, would
> you decode the entire object first?
> 
> Unless you see a good reason to implement this optimization from the start,
> I think the entire object would be decoded.
> 
> > Or, read from OSD1 and reconstruct on error?
> 
> I think that's what we want, ultimately. And also to encode / decode large
> objects (1GB for instance) as a sequence of independant regions (4MB for
> instance or smaller).
> 
> I updated https://github.com/dachary/ceph/blob/wip-
> 4929/doc/dev/osd_internals/erasure-code.rst to reflect our discussion.
> 
> The first, simplest implementation is likely to be fit to use with RGW and
> probably too slow to use with RBD. Do you think we should try to optimize
> for RBD right now ?

Yes, RGW is the obvious best candidate for the first implementation. We don't need to implement for RBD and CephFS now, but we should consider how the design would handle other applications in the future. The alternative is to optimize purely for RGW and provide an API/plug-in capability suggested by Harvey Skinner to make way for optimized solutions for other applications.

All the best
pvs

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html