RE: Inline dedup/compression

"James (Fei) Liu-SSI" <james.liu@xxxxxxxxxxxxxxx> · Tue, 30 Jun 2015 22:29:38 +0000

Hi Chaitanya,
   Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.

    Regards,
    James

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
[James] Does the OSD/PG mean PG Backend over here? 

- Data is segmented (rabin/static) and secure hash computed
[James] Which component in OSD are you going to do the data segment and hash computation?

- A manifest is created with the offset/len/hash for all the segments
[James] The manifest is going to be part of xattr of object? Where are you going to save manifest?

- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
[James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?

- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
[James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?

- Response is received by original primary PG for all segments
[James] What response?

- Primary PG writes the manifest to local and replicas or EC members
[James] How about the dedupe data if the data is not present in replicas?

- Response sent to client

Read:
- Read received at primary PG
[James]  The read can only fetch data from Primary PG?
- Reads manifest object

- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client

Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

-----Original Message-----
From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@xxxxxxxxxxx] 
Sent: Tuesday, June 30, 2015 8:50 AM
To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

- Reference count has to be maintained as an attribute of the object
- As mentioned in the write workflow, duplicate segment writes increment the reference count
- Object Delete would result in delete on constituent segments listed in the object segment manifest
- Segment object delete will decrement reference count and remove the segment when there are no more references present 

Regards,
Chaitanya

-----Original Message-----
From: Allen Samuels
Sent: Tuesday, June 30, 2015 9:02 PM
To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
- PG receiving dedup write will:
        1. check for object presence and create object if not present
        2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object <__know_prefix><secure hash>
- coalesces all the response to build the required data
- Responds to client

Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network

Regards,
Chaitanya

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think more about it.

 Regards,
 James

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@xxxxxxxxxxxxxxx> wrote:
> Hi Haomai,
>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
> 1. Keep the data consistency among OSDs in one PG 2. Saving the 
> computing resources
>
> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>
> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>
> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.

>
> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>
> Regards,
> James
>
>
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
> Sent: Thursday, June 25, 2015 8:08 PM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel
> Subject: Re: Inline dedup/compression
>
> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@xxxxxxxxxxxxxxx> wrote:
>> Hi Cephers,
>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>
> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>
> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>
>>
>>   Regards,
>>   James
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat

--
Best Regards,

Wheat
  칻 & ~ &   +-  ݶ  w  ˛   m  ^  b  ^n r   z   h    &   G   h ( 階 ݢj"   m     z ޖ   f   h   ~ m

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j   f   h   z  w       j:+v   w j m         zZ+     ݢj"  ! i
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f