Supposedly, on 2013-Apr-22, at 08.09 PDT(-0700), someone claiming to be Sage Weil scribed: > On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote: >> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic Dachary scribed: >> >>> Hi Christopher, >>> >>> You wrote "A modified client/library could be used to store objects that should be sharded, vs "standard" ceph treatment. In this model, each shard would be written to a seperate PG, and each PG would we stored on exactly one OSD. " but there is no way for a client to enforce the fact that two objects are stored in separate PG. >> >> Poorly worded. The idea is that each shard becomes a seperate object, and the encoder/sharder would use CRUSH to identify the OSDs to hold the shards. However, the OSDs would treat the shard as an n=1 replication and just store locally. >> >> Actually, looking at this this morning, this is actually harder than the prefered alternative (i.e. grafting a encode/decode into the (e)OSD. It was meant to cover the alternative approaches. I didn't like this one, but it now appears to be more difficult, and non-deterministic of the placement. >> >> One question on CRUSH (it's been too long since I read the paper), if x is the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, or would it forward it to OSD18 to store? If it would this idea is DOA. Also, if x is held invariant, but n changes, does the same R set get returned (truncated to n members)? > > It would go to osd18, the first item in the sequence that CRUSH generates. That's what I thought - then it is a non-starter > > As Loic observes, not having control of placement from above the librados > level makes this more or less a non-started. The only thing that might > work at that layer is to set up ~5 or more pools, each with a distinct set > of OSDs, and put each shard/fragment in a different pool. I don't think > that is a particularly good approach. Pretty much of a kludge - I would agree > > If we are going to do parity encoding (and I think we should!), I think we > should fully integrate it into the OSD. > > The simplest approach: > > - we create a new PG type for 'parity' or 'erasure' or whatever (type > fields already exist) > - those PGs use the parity ('INDEP') crush mode so that placement is > intelligent > - all reads and writes go to the 'primary' > - the primary does the shard encoding and distributes the write pieces to > the other replicas > - same for reads Yup - that's basically what I was trying to outline for the single-tier model. I called them eOSD's. > > There will be a pile of patches to move code around between PG and > ReplicatedPG, which will be annoying, but hopefully not too painful. The > class structure and data types were set up with this in mind long ago. > > Several key challenges: > > - come up with a scheme for internal naming to keep shards distinct > - safely rewriting a stripe when there is a partial overwrite. probably > want to write new stripes to distinct new objects (cloning old data as > needed) and clean up the old ones once enough copies are present. > - recovery logic Been giving this some thought - I'll try and get them into the blueprint. Is the blueprint, as it is, reasonable to include in the design summit, knowing that it will continue to evolve? > > sage > Christopher > >> >> Thx >> Christopher >> >> >> >>> >>> Am I missing something ? >>> >>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote: >>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be Plaetinck, Dieter scribed: >>>> >>>>> On Thu, 18 Apr 2013 16:09:52 -0500 >>>>> Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote: >>>> >>>>>> >>>>> >>>>> @Bryan: I did come across cleversafe. all the articles around it seemed promising, >>>>> but unfortunately it seems everything related to the cleversafe open source project >>>>> somehow vanished from the internet. (e.g. http://www.cleversafe.org/) quite weird... >>>>> >>>>> @Sage: interesting. I thought it would be more relatively simple if one assumes >>>>> the restriction of immutable files. I'm not familiar with those ceph specifics you're mentioning. >>>>> When building an erasure codes-based system, maybe there's ways to reuse existing ceph >>>>> code and/or allow some integration with replication based objects, without aiming for full integration or >>>>> full support of the rados api, based on some tradeoffs. >>>>> >>>>> @Josh, that sounds like an interesting approach. Too bad that page doesn't contain any information yet :) >>>> >>>> Greetings - it does now - see what you all think? >>>> >>>> Christopher >>>> >>>>> >>>>> Dieter >>>> >>>> >>>> -- >>>> ??? >>>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc >>>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf >>>> Check my calendar availability: https://tungle.me/cdl >>> >>> -- >>> Lo?c Dachary, Artisan Logiciel Libre >> >> >> -- >> ??? >> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc >> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf >> Check my calendar availability: https://tungle.me/cdl -- 李柯睿 Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf Check my calendar availability: https://tungle.me/cdl
Attachment:
signature.asc
Description: OpenPGP digital signature