hi all, I agree that the OSD level (storage nodes) is where the encoding/reconstruct should be. From an overall solution perspective, client nodes are sized for the apps they are deploying. It would be a real push to get them to also over provision app/client servers to take on the additional processing needed for encoding, etc. Better to have this as part of the storage node sizing criteria; is the end user is going to use erasure-encoding or replication or both (different pools) on the same storage node platform. Harvey On Fri, Jun 21, 2013 at 5:08 PM, Paul Von-Stamwitz <PVonStamwitz@xxxxxxxxxxxxxx> wrote: > > On 06/21/2013 1:29 AM, Loic Dachary wrote: >> >> On 06/21/2013 03:23 AM, Paul Von-Stamwitz wrote: >> > >> > On 06/20/2013 11:26 PM, Loic Dachary wrote: >> >> >> >> I wrote down a short description of the read/write path I plan to >> >> implement in ceph : >> > https://github.com/dachary/ceph/blob/wip- >> 4929/doc/dev/osd_internals/erasure-code.rst. >> >> A quick look at the drawings will hopefully give you an idea. Each OSD >> is > a disk connected to the others over the network. Although I chose >> K+M >> = 5 > I suspect the most common use case will be around K+M = 7+3 = 10 >> > >> > Hi Loic, >> > >> > A couple questions regarding your proposal: >> > >> > Where would encode/decode occur? Client or OSD? Previous discussions >> seemed to want it at the OSD level. Your proposal seems that it could be >> either. >> >> I think it should be in the OSD. Although it would be possible to >> implement it in the client, it would have to be implemented in the OSD >> anyway for scrubbing. Therefore I think it is simpler to implement it in >> the OSD as a first step. > > Agreed. Encode/decode/reconstruction belongs in the OSD layer. There is a > possibility that the client and OSDs can work in tandem. For example, the > client already does sharding. But, the OSD is currently responsible for > replication, so it should be responsible for parity. For now, we should > strive for encoding to be transparent to the client. > >> >> > >> > What about partial reads? For example, if only 'H' was requested, would >> you decode the entire object first? >> >> Unless you see a good reason to implement this optimization from the >> start, >> I think the entire object would be decoded. >> >> > Or, read from OSD1 and reconstruct on error? >> >> I think that's what we want, ultimately. And also to encode / decode >> large >> objects (1GB for instance) as a sequence of independant regions (4MB for >> instance or smaller). >> >> I updated https://github.com/dachary/ceph/blob/wip- >> 4929/doc/dev/osd_internals/erasure-code.rst to reflect our discussion. >> >> The first, simplest implementation is likely to be fit to use with RGW >> and >> probably too slow to use with RBD. Do you think we should try to optimize >> for RBD right now ? > > Yes, RGW is the obvious best candidate for the first implementation. We > don't need to implement for RBD and CephFS now, but we should consider how > the design would handle other applications in the future. The alternative is > to optimize purely for RGW and provide an API/plug-in capability suggested > by Harvey Skinner to make way for optimized solutions for other > applications. > > All the best > pvs > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html