On Mon, 22 Apr 2013, Loic Dachary wrote: > Hi Sage, > > On 04/22/2013 05:09 PM, Sage Weil wrote: > > On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote: > >> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed: > >> > >>> Hi Christopher, > >>> > >>> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment. In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD. " but there is no way for a client to enforce the fact that two objects are stored in separate PG. > >> > >> Poorly worded. The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards. However, the OSDs would treat the shard as an n=1 replication and just store locally. > >> > >> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD. It was meant to cover the alternative approaches. I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement. > >> > >> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store? If it would this idea is DOA. Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)? > > > > It would go to osd18, the first item in the sequence that CRUSH generates. > > > > As Loic observes, not having control of placement from above the librados > > level makes this more or less a non-started. The only thing that might > > work at that layer is to set up ~5 or more pools, each with a distinct set > > of OSDs, and put each shard/fragment in a different pool. I don't think > > that is a particularly good approach. > > > > If we are going to do parity encoding (and I think we should!), I think we > > should fully integrate it into the OSD. > > > > The simplest approach: > > > > - we create a new PG type for 'parity' or 'erasure' or whatever (type > > fields already exist) > > - those PGs use the parity ('INDEP') crush mode so that placement is > > intelligent > > I assume you do not mean CEPH_FEATURE_INDEP_PG_MAP as used in > > https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L5237 > > but CRUSH_RULE_CHOOSE_INDEP as used in > > https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L331 > > when firstn == 0 because it was set in > > https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L523 Right. > I see that it would be simpler to write > > step choose indep 0 type row > > and then rely on intelligent placement. Is there a reason why it would not be possible to use firstn instead of indep ? The indep placement avoids moving around a shard between ranks, because a mapping of [0,1,2,3,4] will change to [0,6,2,3,4] (or something) if osd.1 fails and the shards on 2,3,4 won't need to be copied around. > > - all reads and writes go to the 'primary' > > - the primary does the shard encoding and distributes the write pieces to > > the other replicas > > Although I understand how that would work when a PG receives a CEPH_OSD_OP_WRITEFULL > > https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L2504 > > It may be inconvenient and expensive to recompute the parity encoded version if an object is written with a series of CEPH_OSD_OP_WRITE. The simplest way would be to decode the existing object, modify it according to what CEPH_OSD_OP_WRITE requires, encode it. Yeah. Making small writes remotely efficient is a huge challenge. OTOH, if we have some sort of symlink/redirect, and we do writes to replicated objects and only write erasure/parity data in full objects, then we avoid that complexity. > > > - same for reads > > > > There will be a pile of patches to move code around between PG and > > ReplicatedPG, which will be annoying, but hopefully not too painful. The > > class structure and data types were set up with this in mind long ago. > > > > Several key challenges: > > > > - come up with a scheme for internal naming to keep shards distinct > > - safely rewriting a stripe when there is a partial overwrite. probably > > want to write new stripes to distinct new objects (cloning old data as > > needed) and clean up the old ones once enough copies are present. > > Do you mean RBD stripes ? I'm thinking stipes of a logical object across the object's shards. Mostly in terms of something like RAID4; I'm not sure what terminology is typically used for erasure coding systems. I'm assuming that stripes of the object would be coded to so that partial object reads/updates of large objects are vaguely efficient... Ideally, the implementation would have a field indicating what coding technique is used (parity, erasure, whatever). Different optimizations are best for different coding strategies, but the minimum useful feature in this case wouldn't optimize anything anyway :) > > - recovery logic > > If recovery is done from the scheduled scrubber in the ErasureCodedPG , I'm not sure if OSD.cc must be modified or is truly independent of the PG type > > https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L3818 > > I'll keep looking, thanks a lot for the hints :-) > > Cheers > > > sage > > > > > >> > >> Thx > >> Christopher > >> > >> > >> > >>> > >>> Am I missing something ? > >>> > >>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote: > >>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed: > >>>> > >>>>> On Thu, 18 Apr 2013 16:09:52 -0500 > >>>>> Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote: > >>>> > >>>>>> > >>>>> > >>>>> @Bryan: I did come across cleversafe. all the articles around it seemed promising, > >>>>> but unfortunately it seems everything related to the cleversafe open source project > >>>>> somehow vanished from the internet. (e.g. http://www.cleversafe.org/) quite weird... > >>>>> > >>>>> @Sage: interesting. I thought it would be more relatively simple if one assumes > >>>>> the restriction of immutable files. I'm not familiar with those ceph specifics you're mentioning. > >>>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph > >>>>> code and/or allow some integration with replication based objects, without aiming for full integration or > >>>>> full support of the rados api, based on some tradeoffs. > >>>>> > >>>>> @Josh, that sounds like an interesting approach. Too bad that page doesn't contain any information yet :) > >>>> > >>>> Greetings - it does now - see what you all think? > >>>> > >>>> Christopher > >>>> > >>>>> > >>>>> Dieter > >>>> > >>>> > >>>> -- > >>>> ??? > >>>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc > >>>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf > >>>> Check my calendar availability: https://tungle.me/cdl > >>> > >>> -- > >>> Lo?c Dachary, Artisan Logiciel Libre > >> > >> > >> -- > >> ??? > >> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc > >> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf > >> Check my calendar availability: https://tungle.me/cdl > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Lo?c Dachary, Artisan Logiciel Libre > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html