Hi Chaitanya, Have you ruled out variants using fixed chunksize? (Arguments for/against fingerprinting elided.) Matt ----- "Chaitanya Huilgol" <Chaitanya.Huilgol@xxxxxxxxxxx> wrote: > Hi James et.al , > > Here is an example for clarity, > 1. Client Writes object object.abcd > 2. Based on the crush rules, say OSD.a is the primary OSD which > receives the write > 3. OSD.a performs segmenting/fingerprinting which can be static or > dynamic and generates a list of segments, the object.abcd is now > represented by a manifest object with the list of segment hash and > len > [Header] > [Seg1_sha, len] > [Seg2_sha, len] > ... > [Seg3_sha, len] > 4. OSD.a writes each segment as a new object in the cluster with > object name <reserved_dedupe_perfix><sha> > 5. The dedupe object write is treated differently from regular object > writes, If the object is present then an object reference count is > incremented and the object is not overwritten - this forms the basis > of the dedupe logic. Multiple objects with one or more same > constituent segments start sharing the segment objects. > 6. Once all the segments are successfully written the object > 'object.abcd' is now just a stub object with the segment manifest as > described above and is goes through a regular object write sequence > > Partial writes on objects will be complicated, > - Partially affected segments will have to be read and segmentation > logic has to be run from first to last affected segment boundaries > - New segments will be written > - Old overwritten segments have to be deleted > - Write merged manifest of the object > > All this will need protection of the PG lock, Also additional > journaling mechanism will be needed to recover from cases where the > osd goes down before writing all the segments. > > Since this is quite a lot of processing, a better use case for this > dedupe mechanism would be in the data tiering model with object > redirects. > The manifest object fits quiet well into object redirects scheme of > things, the idea is that, when an object is moved out of the base > tier, you have an option to create a dedupe stub object and write > individual segments into the cold backend tier with a rados plugin. > > Remaining responses inline. > > Regards, > Chaitanya > > -----Original Message----- > From: James (Fei) Liu-SSI [mailto:james.liu@xxxxxxxxxxxxxxx] > Sent: Wednesday, July 01, 2015 4:00 AM > To: Chaitanya Huilgol; Allen Samuels; Haomai Wang > Cc: ceph-devel > Subject: RE: Inline dedup/compression > > Hi Chaitanya, > Very interesting thoughts. I am not sure whether I get all of them > or now. Here are several questions for the solution you provided, > Might be a little bit detailed. > > Regards, > James > > - Dedupe is set as a pool property > Write: > - Write arrives at the primary OSD/pg > [James] Does the OSD/PG mean PG Backend over here? > [Chaitanya] I mean the Primary OSD and the PG which get selected by > the crush - not the specific OSD component > > - Data is segmented (rabin/static) and secure hash computed [James] > Which component in OSD are you going to do the data segment and hash > computation? > [Chaitanya] If partial writes are not supported then this could be > down before acquiring the PG lock, else we need the protection of the > PG lock. Probably in the do_request() path? > > - A manifest is created with the offset/len/hash for all the segments > [James] The manifest is going to be part of xattr of object? Where are > you going to save manifest? > [Chaitanya] The manifest is a stub object with the constituent > segments list > > - OSD/pg sends rados write with a special name > <__known__prefix><secure hash> for all segments [James] What's your > meaning of Rados Wirte? Where do the all segments with secure hash > signature write to? > [Chaitanya] All segments are unique objects with the above mentioned > naming scheme, they get written back into the cluster as a regular > client rados object write > > - PG receiving dedup write will: > 1. check for object presence and create object if not present > 2. If object is already present, then an reference count is > incremented (check and increment needs to be atomic) [James] It makes > sense. But I was wondering the unit for dedupe is segment or object? > If object base, it totally make sense. However, why we need to have > segment with manifest? > > - Response is received by original primary PG for all segments [James] > What response? > [Chaitanya] Write response indicating the status of the segment object > write > > - Primary PG writes the manifest to local and replicas or EC members > [James] How about the dedupe data if the data is not present in > replicas? > [Chaitanya] I am sorry, I did not get your question, the manifest > object gets written in the primary and the replicas or encoded and > written to the EC members, it is afforded the protection policy set > for the pool. Same is the case with the individual constituent > segments. > > - Response sent to client > > Read: > - Read received at primary PG > [James] The read can only fetch data from Primary PG? > - Reads manifest object > > - sends reads for each segment object <__know_prefix><secure hash> > - coalesces all the response to build the required data > - Responds to client > > > Pros: > No need of centralized hash index so inline with ceph no bottleneck > philosophy > > Cons: > Some PGs may get overloaded due to frequently occurring segment > patterns Latency and increased traffic on the network > > > > -----Original Message----- > From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@xxxxxxxxxxx] > Sent: Tuesday, June 30, 2015 8:50 AM > To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang > Cc: ceph-devel > Subject: RE: Inline dedup/compression > > > - Reference count has to be maintained as an attribute of the object > - As mentioned in the write workflow, duplicate segment writes > increment the reference count > - Object Delete would result in delete on constituent segments listed > in the object segment manifest > - Segment object delete will decrement reference count and remove the > segment when there are no more references present > > Regards, > Chaitanya > > -----Original Message----- > From: Allen Samuels > Sent: Tuesday, June 30, 2015 9:02 PM > To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang > Cc: ceph-devel > Subject: RE: Inline dedup/compression > > This covers the read and write, what about the delete? One of the > major issues with Dedupe, whether global or local is to address the > inherent ref-counting associated with sharing of pieces of storage. > > Allen Samuels > Software Architect, Emerging Storage Solutions > > 2880 Junction Avenue, Milpitas, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@xxxxxxxxxxx > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Chaitanya > Huilgol > Sent: Monday, June 29, 2015 11:20 PM > To: James (Fei) Liu-SSI; Haomai Wang > Cc: ceph-devel > Subject: RE: Inline dedup/compression > > Below is an alternative idea at a very high level around dedup with > ceph without a need of centralized hash index, > > - Dedupe is set as a pool property > Write: > - Write arrives at the primary OSD/pg > - Data is segmented (rabin/static) and secure hash computed > - A manifest is created with the offset/len/hash for all the segments > - OSD/pg sends rados write with a special name > <__known__prefix><secure hash> for all segments > - PG receiving dedup write will: > 1. check for object presence and create object if not present > 2. If object is already present, then an reference count is > incremented (check and increment needs to be atomic) > - Response is received by original primary PG for all segments > - Primary PG writes the manifest to local and replicas or EC members > - Response sent to client > > Read: > - Read received at primary PG > - Reads manifest object > - sends reads for each segment object <__know_prefix><secure hash> > - coalesces all the response to build the required data > - Responds to client > > > Pros: > No need of centralized hash index so inline with ceph no bottleneck > philosophy > > Cons: > Some PGs may get overloaded due to frequently occurring segment > patterns Latency and increased traffic on the network > > Regards, > Chaitanya > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of James (Fei) > Liu-SSI > Sent: Tuesday, June 30, 2015 2:25 AM > To: Haomai Wang > Cc: ceph-devel > Subject: RE: Inline dedup/compression > > Hi Haomai, > Thanks for moving the idea forward. Regarding to the compression. > However, if we do compression on the client level, it is not global. > And the compression was only applied to the local client, am I right? > I think there is pros and cons in two solutions and we can get into > details more for each solution. > I really like your idea for dedupe in OSD side by the way. Let me > think more about it. > > Regards, > James > > -----Original Message----- > From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] > Sent: Friday, June 26, 2015 8:55 PM > To: James (Fei) Liu-SSI > Cc: ceph-devel > Subject: Re: Inline dedup/compression > > On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI > <james.liu@xxxxxxxxxxxxxxx> wrote: > > Hi Haomai, > > Thanks for your response as always. I agree compression is > comparable easier task but still very challenge in terms of > implementation no matter where we should implement . Client side like > RBD, or RDBGW or CephFS, or PG should be a little bit better place to > implementation in terms of efficiency and cost reduction before the > data were duplicated to other OSDs. It has two reasons : > > 1. Keep the data consistency among OSDs in one PG 2. Saving the > > computing resources > > > > IMHO , The compression should be accomplished before the replication > come into play in pool level. However, we can also have second level > of compression in the local objectstore. In term of unit size of > compression , It really depends workload and in which layer we should > implement. > > > > About inline deduplication, it will dramatically increase the > complexities if we bring in the replication and Erasure Coding for > consideration. > > > > However, Before we talk about implementation, It would be great if > we can understand the pros and cons to implement inline > dedupe/compression. We all understand the benefits of > dedupe/compression. However, the side effect is performance hurt and > need more computing resources. It would be great if we can understand > the problems from 30,000 feet high for the whole picture about the > Ceph. Please correct me if I were wrong. > > Actually we may have some tricks to reduce performance hurt like > compression. As Joe mentioned, we can compress slave pg data to avoid > performance hurt, but it may increase the complexity of recovery and > pg remap things. Another in-detail implement way if we begin to > compress data from messenger, osd thread and pg thread won't access > data for normal client op, so maybe we can make it parallel with pg > process. Journal thread will get the compressed data at last. > > The effect of compression also is a concern, we do compression in > rados may not get the best compression result. If we can do > compression in libcephfs, librbd and radosgw and make rados unknown to > compression, it maybe simpler and we can get file/block/object level > compression. it should be better? > > About dedup, my current idea is we could setup a memory pool at osd > side for checksum store usage. Then we calculate object data and map > to PG instead of object name at client side, so a object could always > in a osd where it's also responsible for dedup storage. It also could > be distributed at pool level. > > > > > > By the way, Both of software defined storage solution startups like > Hdevig and Springpath provide inline dedupe/compression. It is not > apple to apple comparison. But it is good reference. The datacenters > need cost effective solution. > > > > Regards, > > James > > > > > > > > -----Original Message----- > > From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] > > Sent: Thursday, June 25, 2015 8:08 PM > > To: James (Fei) Liu-SSI > > Cc: ceph-devel > > Subject: Re: Inline dedup/compression > > > > On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI > <james.liu@xxxxxxxxxxxxxxx> wrote: > >> Hi Cephers, > >> It is not easy to ask when Ceph is going to support inline > dedup/compression across OSDs in RADOS because it is not easy task and > answered. Ceph is providing replication and EC for performance and > failure recovery. But we also lose the efficiency of storage store > and cost associate with it. It is kind of contradicted with each > other. But I am curious how other Cephers think about this question. > >> Any plan for Cephers to do anything regarding to inline > dedupe/compression except the features brought by local node itself > like BRTFS? > > > > Compression is easier to implement in rados than dedup. The most > important thing about compression is where we begin to compress, > client, pg or objectstore. Then we need to decide how much the > compress unit is. Of course, compress and dedup both like to use > keyvalue-alike storage api to use, but I think it's not difficult to > use existing objectstore api. > > > > Dedup is more possible to implement in local osd instead of the > whole pool or cluster, and if we want to do dedup for the pool level, > we need to do dedup from client. > > > >> > >> Regards, > >> James > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > ceph-devel" > >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > >> info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best Regards, > > > > Wheat > > > > -- > Best Regards, > > Wheat > 칻 & ~ & +- ݶ w ˛ m ^ b ^n r z h & G h ( 階 ݢj" > m z ޖ f h ~ m > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message > is intended only for the use of the designated recipient(s) named > above. If the reader of this message is not the intended recipient, > you are hereby notified that you have received this message in error > and that any review, dissemination, distribution, or copying of this > message is strictly prohibited. If you have received this > communication in error, please notify the sender by telephone or > e-mail (as shown above) immediately and destroy any and all copies of > this message in your possession (whether hard copies or electronically > stored copies). > > N r y b X ǧv ^ ){.n + z ]z {ay ʇڙ ,j f h z w > j:+v w j m zZ+ ݢj" ! i > N�����r��y���b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� -- Matt Benjamin CohortFS, LLC. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://cohortfs.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html