RE: Inline dedup/compression

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Fri, 21 Aug 2015 03:37:08 +0000

XFS shouldn't have any trouble with the "holes" scheme. I don't know BTRFS as well, but I doubt it's significantly different.

If we assume that the logical address space of a file is broken up into fixed sized chunks on fixed size boundaries (presumably a power of 2) then the implementation is quite straightforward.

Picking the chunk size will be a key issue for performance. Unfortunately, there are competing desires.

For best space utilization, you'll want the chunk size to be large, because on average you'll lose 1/2 of a file system sector/block for each chunk of compressed data.

For best R/W performance, you'll want the chunk size to be small, because logically the file I/O size is equal to a chunk, i.e., on a write you might have to read the corresponding chunk, decompress it, insert the new data and recompress it. This gets super duper ugly on FileStore because you can't afford to crash during the re-write update and risk a partially updated chunk (this will give you garbage when you decompress it). This means that you'll have to log the entire chunk even if you're only re-writing a small portion of it. Hence the desire to make the chunksize small. I'm not as familiar with NewStore, but I don't think it's fundamentally much better. Basically any form of sub-chunk write-operation stinks in performance. Sub-chunk read operations aren't too bad unless the chunk size is ridiculously large. 

For best compression ratios, you'll want the chunk size to be at least equal to the history size if not 2 or 3 times larger (64K history size when using zlib, snappy is 32K or 64K for the latest version)

The partial-block write problem doesn't exist for RGW objects and it's objects are probably already compressed. Meaning that you'll want to be able to convey the compression parameters to RADOS so that the backend knows what to do.

I would add a per-file attribute that encodes the compression parameters:  compression algorithm (zlib, snappy, ...) and chunksize. That would also provide backward compatibility and allow per-object compression diversity.

Then you'd want to add verbiage to the individual access schemes to allow/disallow compression. For file systems you'd want that on a per-directory basis or perhaps even better a set of regular expressions.

Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] 
Sent: Thursday, August 20, 2015 8:01 PM
To: Allen Samuels
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: Re: Inline dedup/compression

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
> I found a 
> blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/
> ) about mysql innodb transparent compression. It's surprised that 
> innodb will do it at low level(just like filestore in ceph) and rely 
> it on filesystem file hole feature. I'm very suspect about the 
> performance afeter storing lot's of *small* hole files on fs. If 
> reliable, it would be easy that filestore/newstore impl alike feature.
>
> On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels <Allen.Samuels@xxxxxxxxxxx> wrote:
>> For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count).
>>
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol
>> Sent: Thursday, July 02, 2015 3:50 AM
>> To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi James et.al ,
>>
>> Here is an example for clarity,
>> 1. Client Writes object  object.abcd
>> 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the write 3. OSD.a  performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, len]  [Seg2_sha, len]  ...
>>  [Seg3_sha, len]
>> 4. OSD.a writes each segment as a new object in the cluster with object name  <reserved_dedupe_perfix><sha> 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects.
>> 6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence
>>
>> Partial writes on objects will be complicated,
>> - Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries
>> -  New segments will be written
>> - Old overwritten segments have to be deleted
>> - Write merged manifest of the object
>>
>> All this will need protection of the PG lock, Also additional journaling mechanism will be needed to  recover from cases where the osd goes down before writing all the segments.
>>
>> Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects.
>> The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin.
>>
>> Remaining responses inline.
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: James (Fei) Liu-SSI [mailto:james.liu@xxxxxxxxxxxxxxx]
>> Sent: Wednesday, July 01, 2015 4:00 AM
>> To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Chaitanya,
>>    Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed.
>>
>>     Regards,
>>     James
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> [James] Does the OSD/PG mean PG Backend over here?
>> [Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component
>>
>> - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation?
>> [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock.  Probably in the do_request() path?
>>
>> - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest?
>> [Chaitanya] The manifest is a stub object with the constituent segments list
>>
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments [James] What's your meaning of Rados Wirte?  Where do the all segments with secure hash signature write to?
>> [Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write
>>
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) [James] It makes sense. But I was wondering the unit for dedupe is segment or object? If object base, it totally make sense. However, why we need to have segment with manifest?
>>
>> - Response is received by original primary PG for all segments [James] What response?
>> [Chaitanya] Write response indicating the status of the segment object write
>>
>> - Primary PG writes the manifest to local and replicas or EC members [James] How about the dedupe data if the data is not present in replicas?
>> [Chaitanya] I am sorry, I did not get your question, the manifest object gets written in the primary and the replicas or encoded and written to the EC members, it is afforded the protection policy set for the pool. Same is the case with the individual constituent segments.
>>
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> [James]  The read can only fetch data from Primary PG?
>> - Reads manifest object
>>
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>>
>>
>>
>> -----Original Message-----
>> From: Chaitanya Huilgol [mailto:Chaitanya.Huilgol@xxxxxxxxxxx]
>> Sent: Tuesday, June 30, 2015 8:50 AM
>> To: Allen Samuels; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>>
>> - Reference count has to be maintained as an attribute of the object
>> - As mentioned in the write workflow, duplicate segment writes increment the reference count
>> - Object Delete would result in delete on constituent segments listed in the object segment manifest
>> - Segment object delete will decrement reference count and remove the segment when there are no more references present
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: Allen Samuels
>> Sent: Tuesday, June 30, 2015 9:02 PM
>> To: Chaitanya Huilgol; James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage.
>>
>> Allen Samuels
>> Software Architect, Emerging Storage Solutions
>>
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@xxxxxxxxxxx
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Chaitanya Huilgol
>> Sent: Monday, June 29, 2015 11:20 PM
>> To: James (Fei) Liu-SSI; Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index,
>>
>> - Dedupe is set as a pool property
>> Write:
>> - Write arrives at the primary OSD/pg
>> - Data is segmented (rabin/static) and secure hash computed
>> - A manifest is created with the offset/len/hash for all the segments
>> - OSD/pg sends rados write with a special name <__known__prefix><secure hash> for all segments
>> - PG receiving dedup write will:
>>         1. check for object presence and create object if not present
>>         2. If object is already present, then an reference count is incremented (check and increment needs to be atomic)
>> - Response is received by original primary PG for all segments
>> - Primary PG writes the manifest to local and replicas or EC members
>> - Response sent to client
>>
>> Read:
>> - Read received at primary PG
>> - Reads manifest object
>> - sends reads for each segment object <__know_prefix><secure hash>
>> - coalesces all the response to build the required data
>> - Responds to client
>>
>>
>> Pros:
>> No need of centralized hash index so inline with ceph no bottleneck philosophy
>>
>> Cons:
>> Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network
>>
>> Regards,
>> Chaitanya
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of James (Fei) Liu-SSI
>> Sent: Tuesday, June 30, 2015 2:25 AM
>> To: Haomai Wang
>> Cc: ceph-devel
>> Subject: RE: Inline dedup/compression
>>
>> Hi Haomai,
>>   Thanks for moving the idea forward. Regarding to the compression.  However,  if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right?  I think there is pros and cons in two solutions and we can get into details more for each solution.
>>   I really like your idea for dedupe in OSD side   by the way. Let me think more about it.
>>
>>  Regards,
>>  James
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
>> Sent: Friday, June 26, 2015 8:55 PM
>> To: James (Fei) Liu-SSI
>> Cc: ceph-devel
>> Subject: Re: Inline dedup/compression
>>
>> On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI <james.liu@xxxxxxxxxxxxxxx> wrote:
>>> Hi Haomai,
>>>   Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has  two reasons :
>>> 1. Keep the data consistency among OSDs in one PG 2. Saving the
>>> computing resources
>>>
>>> IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore.  In term of unit size of compression , It really depends workload and in which layer we should implement.
>>>
>>> About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration.
>>>
>>> However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong.
>>
>> Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last.
>>
>> The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better?
>>
>> About dedup, my current idea is we could setup a memory pool at osd side for checksum store usage. Then we calculate object data and map to PG instead of object name at client side, so a object could always in a osd where it's also responsible for dedup storage. It also could be distributed at pool level.
>>
>>
>>>
>>> By the way, Both of software defined storage solution startups like Hdevig and Springpath provide inline dedupe/compression.  It is not apple to apple comparison. But it is good reference. The datacenters need cost effective solution.
>>>
>>> Regards,
>>> James
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx]
>>> Sent: Thursday, June 25, 2015 8:08 PM
>>> To: James (Fei) Liu-SSI
>>> Cc: ceph-devel
>>> Subject: Re: Inline dedup/compression
>>>
>>> On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI <james.liu@xxxxxxxxxxxxxxx> wrote:
>>>> Hi Cephers,
>>>>     It is not easy to ask when Ceph is going to support inline dedup/compression across OSDs in RADOS because it is not easy task and answered. Ceph is providing replication and EC for performance and failure recovery. But we also lose the efficiency  of storage store and cost associate with it. It is kind of contradicted with each other. But I am curious how other Cephers think about this question.
>>>>    Any plan for Cephers to do anything regarding to inline dedupe/compression except the features brought by local node itself like BRTFS?
>>>
>>> Compression is easier to implement in rados than dedup. The most important thing about compression is where we begin to compress, client, pg or objectstore. Then we need to decide how much the compress unit is. Of course, compress and dedup both like to use keyvalue-alike storage api to use, but I think it's not difficult to use existing objectstore api.
>>>
>>> Dedup is more possible to implement in local osd instead of the whole pool or cluster, and if we want to do dedup for the pool level, we need to do dedup from client.
>>>
>>>>
>>>>   Regards,
>>>>   James
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i
>
>
>
> --
> Best Regards,
>
> Wheat

-- 
Best Regards,

Wheat
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f