> -----Original Message----- > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > Sent: Tuesday, March 15, 2016 12:12 PM > To: Igor Fedotov <ifedotov@xxxxxxxxxxxx> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph- > devel@xxxxxxxxxxxxxxx> > Subject: Re: Adding compression support for bluestore. > > On Fri, 26 Feb 2016, Igor Fedotov wrote: > > Allen, > > sounds good! Thank you. > > > > Please find the updated proposal below. It extends proposed Block Map > > to contain "full tuple". > > Some improvements and better algorithm overview were added as well. > > > > Preface: > > Bluestore keeps object content using a set of dispersed extents > > aligned by 64K (configurable param). It also permits gaps in object > > content i.e. it prevents storage space allocation for object data regions > unaffected by user writes. > > A sort of following mapping is used for tracking stored object content > > disposition (actual current implementation may differ but > > representation below seems to be sufficient for our purposes): > > Extent Map > > { > > < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ... > > < logical offset N -> extent N 'physical' offset, extent N size > } > > > > > > Compression support approach: > > The aim is to provide generic compression support allowing random > > object read/write. > > To do that compression engine to be placed (logically - actual > > implementation may be discussed later) on top of bluestore to > > "intercept" read-write requests and modify them as needed. > > I think it is going to make the most sense to do the compression and > decompression in _do_write and _do_read (or helpers), within bluestore-- > not in some layer that sits above it but communicates metadata down to it. I agree. I don't see a real gain from separation, plus there is incremental complexity that is caused by the separation. > > > The major idea is to split object content into variable sized logical > > blocks that are compressed independently. Resulting block offsets and > > sizes depends mostly on client writes spreading and block merging > > algorithm that compression engine can provide. Maximum size of each > > block to be limited ( MAX_BLOCK_SIZE, e.g. 4 Mb ) to prevent from huge > > block processing when handling read/overwrites. > > bluestore_max_compressed_block or similar -- it should be a configurable > value like everything else. > > > Due to compression each block can potentially occupy smaller store > > space comparing to its original size. Each block is addressed using > > original data offset ( AKA 'logical offset' above ). After compression > > is applied each compressed block is written using the existing > > bluestore infra. Updated write request to the bluestore specifies the > > block's logical offset similar to the one from the original request but data > length can be reduced. > > As a result stored object content: > > a) Has gaps > > b) Uses less space if compression was beneficial enough. > > > > To track compressed content additional block mapping to be introduced: > > Block Map > > { > > < logical block offset, logical block size -> compression method, > > target offset, compressed size > ... > > < logical block offset, logical block size -> compression method, > > target offset, compressed size > } Note 1: Actually for the current > > proposal target offset is always equal to the logical one. It's > > crucial that compression doesn't perform complete > > address(offset) translation but simply brings "space saving holes" > > into existing object content layout. This eliminates the need for > > significant bluestore interface modifications. > > I'm not sure I understand what the target offset field is needed for, then. In > practice, this will mean expanding bluestore_extent_t to have > compression_type and decompressed_length fields. The logical offset is the > key in the block_map map<>... > > > To effectively use store space one needs an additional ability from > > the bluestore interface - release logical extent within object content > > as well as underlying physical extents allocated for it. In fact > > current interface (Write > > request) allows to "allocate" ( by writing data) logical extent while > > leaving some of them "unallocated" (by omitting corresponding range). > > But there is no release procedure - move extent to "unallocated" > > space. Please note - this is mainly about logical extent - a region > > within object content. Means for allocate/release physical extents (regions > at block device) are present. > > In case of compression such logical extent release is most probably > > paired with writing to the same ( but reduced ) extent. And it looks > > like there is no need for standalone "release" request. So the > > suggestion is to introduce extended write request (WRITE+RELEASE) that > > releases specified logical extent and writes new data block. The key > parameters for the request are: > > DATA2WRITE_LOFFSET, DATA2WRITE_SIZE, RELEASED_EXTENT_LOFFSET, > > RELEASED_EXTENT_SIZE > > where: > > assert(DATA2WRITE_LOFFSET >= RELEASED_EXTENT_LOFFSET) > > assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >= > > DATA2WRITE_LOFFSET + > > DATA2WRITE_SIZE) > > I'm not following this. You can logically release a range with zero() (it will > mostly deallocate extents, but write zeros into a partial extent). > But it sounds like this is only needed if you want to manage the compressed > extents above BlueStore... which I think is going to be a more difficult > approach. Yes, it seems this is the extra complexity that you get from trying to separate compression into a layer "above" BlueStore. > > > Due to the fact that bluestore infrastructure tracks extents with some > > granularity (bluestore_min_alloc_size, 64Kb by default) > > RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned > at > > bluestore_min_alloc_size boundary: > > assert(RELEASED_EXTENT_LOFFSET % min_alloc_size == 0); > > assert(RELEASED_EXTENT_SIZE % min_alloc_size == 0); > > > > As a result compression engine gains a responsibility to properly > > handle cases when some blocks use the same bluestore allocation unit > > but aren't fully adjacent (see below for details). > > The rest of this is mostly about overwrite policy, which should be pretty > general regardless of where we implement. I think the first and more > important thing to sort out though is where we do it. > > My current thinking is that we do something like: > > - add a bluestore_extent_t flag for FLAG_COMPRESSED > - add uncompressed_length and compression_alg fields > (- add a checksum field we are at it, I guess) > > - in _do_write, when we are writing a new extent, we need to compress it in > memory (up to the max compression block), and feed that size into > _do_allocate so we know how much disk space to allocate. this is probably > reasonably tricky to do, and handles just the simplest case (writing a new > extent to a new object, or appending to an existing one, and writing the new > data compressed). The current _do_allocate interface and responsibilities > will probably need to change quite a bit here. > > - define the general (partial) overwrite strategy. I would like for this to be > part of the WAL strategy. That is, we do the read/modify/write as deferred > work for the partial regions that overlap existing extents. > Then _do_wal_op would read the compressed extent, merge it with the > new piece, and write out the new (compressed) extents. The problem is > that right now the WAL path *just* does IO--it doesn't do any kv metadata > updates, which would be required here to do the final allocation (we won't > know how big the resulting extent will be until we decompress the old thing, > merge it with the new thing, and recompress). > > But, we need to address this anyway to support CRCs (where we will similarly > do a read/modify/write, calculate a new checksum, and need to update the > onode). I think the answer here is just that the _do_wal_op updates some > in-memory-state attached to the wal operation that gets applied when the > wal entry is cleaned up in _kv_sync_thread (wal_cleaning list). > > Calling into the allocator in the WAL path will be more complicated than just > updating the checksum in the onode, but I think it's doable. > > The alternative is that we either > > a) do the read side of the overwrite in the first phase of the op, before we > commit it. That will mean a higher commit latency and will slow down the > pipeline, but would avoid the double-write of the overlap/wal regions. Or, > > b) we could just leave the overwritten extents alone and structure the > block_map so that they are occluded. This will 'leak' space for some write > patterns, but that might be okay given that we can come back later and clean > it up, or refine our strategy to be smarter. I'm a strong believer in (b). > > What do you think? > > It would be nice to choose a simpler strategy for the first pass that handles a > subset of write patterns (i.e., sequential writes, possibly > unaligned) that is still a step in the direction of the more robust strategy we > expect to implement after that. > > sage > > > > > Overwrite request handling can be done following way: > > 0) Write request <logical offset(OFFS), Data Len(LEN)> is received by > > the compression engine. > > 1) Engine inspects the Block Map and checks if new block <OFFS, LEN> > > intersects with existing ones. > > Following cases for existing blocks are possible - > > a) Detached > > b) Adjacent > > c) Partially overwritten > > d) Completely overwritten > > 2) Engine retrieves(and decompresses if needed) content for existing > > blocks from case c) and, optionally, b). Blocks for case b) are > > handled if compression engine should provide block merge algorithm, > > i.e. it has to merge adjacent blocks to decrease block fragmentation. > > There are two options here with regard to what to be considered as > > adjacent. The first is a regular notion (henceforth - fully adjacent) > > - blocks are next to each other and there are no holes between them. > > The second is that blocks are adjacent when they are either fully > > adjacent or reside in the same (or probably neighboring) bluestore > > allocation unit(s). It looks like the second notion provides better > > space reuse and simplifies handling the case when blocks reside at the > > same allocation unit but are not fully adjacent. We can treat that the > > same manner as fully adjacent blocks. The cost is potential increase in > amount of data overwrite request handling has to process (i.e. > > read/decompress/compress/write). But that's the general caveat if > > block merge is used. > > 3) Retrieved block contents and the new one are merged. Resulting > > block might have different logical offset/len pair: <OFFS_MERGED, > > LEN_MERGED>. If resulting block is longer than BLOCK_MAX_LEN it's > > broken up into smaller blocks that are processed independently in the > same manner. > > 4) Generated block is compressed. Corresponding tuple <OFFS_MERGED, > > LEN_MERGED, LEN_COMPRESSED, ALG>. > > 5) Block map is updated: merged/overwritten blocks are removed - the > > generated ones are appended. > > 6) If generated block ( OFFS_MERGED, LEN_MERGED ) still shares > > bluestore allocation units with some existing blocks, e.g. if block > > merge algorithm isn't used(implemented), then overlapping regions to > > be excluded from the release procedure performed at step 8: > > if( shares_head ) > > RELEASE_EXTENT_OFFSET = ROUND_UP_TO( OFFS_MERGED, > > min_alloc_unit_size ) else > > RELEASE_EXTENT_OFFSET = ROUND_DOWN_TO( OFFS_MERGED, > > min_alloc_unit_size ) if( shares_tail ) > > RELEASE_EXTENT_OFFSET_END = ROUND_DOWN_TO( OFFS_MERGED + > LEN_MERGED, > > min_alloc_unit_size ) else > > RELEASE_EXTENT_OFFSET_END = ROUND_UP_TO( OFFS_MERGED + > LEN_MERGED, > > min_alloc_unit_size ) > > > > Thus we might squeeze the extent to release if other blocks use that > space. > > > > 7) If compressed block ( OFFS_MERGED, LEN_COMPRESSED ) still shares > > bluestore allocation units with some existing blocks, e.g. if block > > merge algorithm isn't used(implemented), then overlapping regions > > (head and tail) to be written to bluestore using regular bluestore > > writes. HEAD_LEN & HEAD_TAIL bytes are written correspondingly. > > 8) The rest part of the new block should be written using above > > mentioned > > WRITE+RELEASE request. Following parameters to be used for the request: > > DATA2WRITE_LOFFSET = OFFS_MERGED + HEAD_LEN DATA2WRITE_SIZE = > > LEN_COMPRESSED - HEAD_LEN - TAIL_LEN RELEASED_EXTENT_LOFFSET = > > RELEASED_EXTENT_OFFSET RELEASED_EXTENT_SIZE = > > RELEASE_EXTENT_OFFSET_END - RELEASE_EXTENT_OFFSET > > where: > > #define ROUND_DOWN_TO( a, size ) (a - a % size) > > > > This way we release the extent corresponding to the newly generated > > block ( except partially overlapping tail and head parts if any ) and > > write compressed block to the store that allocates a new extent. > > > > Below is a sample of mappings transform. All values are in Kb. > > 1) Block merge is used. > > > > Original Block Map > > { > > 0, 50 -> 0, 50 , No compress > > 140, 50 -> 140, 50, No Compress ( will merge this block, partially > > overwritten ) > > 255, 100 -> 255, 100, No Compress ( will merge this block, implicitly > > adjacent ) > > 512, 64 -> 512, 64, No Compress > > } > > > > => Write ( 150, 100 ) > > > > New Block Map > > { > > 0, 50 -> 0, 50 Kb, No compress > > 140, 215 -> 140, 100, zlib ( 215 Kb compressed into 100 Kb ) > > 512, 64 -> 512, 64, No Compress > > } > > > > Operations on the bluestore: > > READ( 140, 50) > > READ( 255, 100) > > WRITE-RELEASE( <140, 100>, <128, 256> ) > > > > 2) No block merge. > > > > Original Block Map > > { > > 0, 50 -> 0, 50 , No compress > > 140, 50 -> 140, 50, No Compress > > 255, 100 -> 255, 100, No Compress > > 512, 64 -> 512, 64, No Compress > > } > > > > => Write ( 150, 100 ) > > > > New Block Map > > { > > 0, 50 -> 0, 50 Kb, No compress > > 140, 110 -> 140, 110, No Compress > > 255, 100 -> 255, 100, No Compress > > 512, 64 -> 512, 64, No Compress > > } > > > > Operations on the bluestore: > > READ(140, 50) > > WRITE-RELEASE( <140, 52>, <128, 64> ) > > WRITE( <192, 58> ) > > > > > > Any comments/suggestions are highly appreciated. > > > > Kind regards, > > Igor. > > > > > > On 24.02.2016 21:43, Allen Samuels wrote: > > > w.r.t. (1) Except for "permanent" -- essentially yes. My central > > > point is that by having the full tuple you decouple the actual > > > algorithm from its persistent expression. In the example that you > > > give, you have one representation of the final result. There are > > > other possible final results (i.e., by RMWing some of the smaller chunks -- > as you originally proposed). > > > You even have the option of doing the RMWing/compaction in a > > > background low-priority process (part of the scrub?). > > > > > > You may be right about the effect of (2), but maybe not. > > > > > > I agree that more discussion about checksums is useful. It's > > > essential that BlueStore properly augment device-level integrity checks. > > > > > > Allen Samuels > > > Software Architect, Emerging Storage Solutions > > > > > > 2880 Junction Avenue, Milpitas, CA 95134 > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > > > > > > > -----Original Message----- > > > From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > > > Sent: Wednesday, February 24, 2016 10:19 AM > > > To: Sage Weil <sage@xxxxxxxxxxxx>; Allen Samuels > > > <Allen.Samuels@xxxxxxxxxxx> > > > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > > > Subject: Re: Adding compression support for bluestore. > > > > > > Allen, Sage > > > > > > thanks a lot for interesting input. > > > > > > May I have some clarification and highlight some caveats though? > > > > > > 1) Allen, are you suggesting to have permanent logical blocks layout > > > established after the initial writing? > > > Please find what I mean at the example below ( logical offset/size > > > are provided only for the sake of simplicity). > > > Imagine client has performed multiple writes that created following > > > map <logical offset, logical size>: > > > <0, 100> > > > <100, 50> > > > <150, 70> > > > <230, 70> > > > and an overwrite request <120,70> is coming. > > > The question is if resulting mapping to be the same or should be > > > updated as > > > below: > > > <0,100> > > > <100, 20> //updated extent > > > <120, 100> //new extent > > > <220, 10> //updated extent > > > <230, 70> > > > > > > 2) In fact "Application units" that write requests delivers to > > > BlueStore are pretty( or even completely) distorted by Ceph > > > internals (Caching infra, striping, EC). Thus there is a chance we > > > are dealing with a broken picture and suggested modification brings > no/minor benefit. > > > > > > 3) Sage - could you please elaborate the per-extent checksum use > > > case - how are we planing to use that? > > > > > > Thanks, > > > Igor. > > > > > > On 22.02.2016 15:25, Sage Weil wrote: > > > > On Fri, 19 Feb 2016, Allen Samuels wrote: > > > > > This is a good start to an architecture for performing compression. > > > > > > > > > > I am concerned that it's a bit too simple at the expense of > > > > > potentially significant performance. In particular, I believe > > > > > it's often inefficient to force compression to be performed in > > > > > block sizes and alignments that may not match the application's > usage. > > > > > > > > > > I think that extent mapping should be enhanced to include the full > > > > > tuple: <Logical offset, Logical Size, Physical offset, Physical size, > > > > > compression algo> > > > > I agree. > > > > > > > > > With the full tuple, you can compress data in the natural units > > > > > of the application (which is most likely the size of the write > > > > > operation that you received) and on its natural alignment (which > > > > > will eliminate a lot of expensive-and-hard-to-handle partial > > > > > overwrites) rather than the proposal of a fixed size compression block > on fixed boundaries. > > > > > > > > > > Using the application's natural block size for performing > > > > > compression may allow you a greater choice of compression > > > > > algorithms. For example, if you're doing 1MB object writes, then > > > > > you might want to be using bzip-ish algorithms that have large > > > > > compression windows rather than the 32-K limited zlib algorithm > > > > > or the 64-k limited snappy. You wouldn't want to do that if all > > > > > compression was limited to a fixed 64K window. > > > > > > > > > > With this extra information a number of interesting algorithm > > > > > choices become available. For example, in the partial-overwrite > > > > > case you can just delay recovering the partially overwritten > > > > > data by having an extent that overlaps a previous extent. > > > > Yep. > > > > > > > > > One objection to the increased extent tuple is that amount of > > > > > space/memory it would consume. This need not be the case, the > > > > > existing BlueStore architecture stores the extent map in a > > > > > serialized format different from the in-memory format. It would > > > > > be relatively simple to create multiple serialization formats > > > > > that optimize for the typical cases of when the logical space is > > > > > contiguous (i.e., logical offset is previous logical offset + > > > > > logical size) and when there's no compression (logical size == > > > > > physical size). Only the deserialized in-memory format of the extent > table has the fully populated tuples. > > > > > In fact this is a desirable optimization for the current > > > > > bluestore regardless of whether this compression proposal is adopted > or not. > > > > Yeah. > > > > > > > > The other bit we should probably think about here is how to store > > > > checksums. In the compressed extent case, a simple approach would > > > > be to just add the checksum (either compressed, uncompressed, or > > > > both) to the extent tuple, since the extent will generally need to > > > > be read in its entirety anyway. For uncompressed extents, that's > > > > not the case, and having an independent map of checksums over > > > > smaller block sizes makes sense, but that doesn't play well with > > > > the variable alignment/extent size approach. I kind of sucks to > > > > have multiple formats here, but if we can hide it behind the > > > > in-memory representation and/or interface (so that, e.g., each > > > > extent has a checksum block size and a vector of checksums) we can > > > > optimize the encoding however we like without affecting other code. > > > > > > > > sage > > > > > > > > > Allen Samuels > > > > > Software Architect, Fellow, Systems and Software Solutions > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134 > > > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Igor > > > > > Fedotov > > > > > Sent: Tuesday, February 16, 2016 4:11 PM > > > > > To: Haomai Wang <haomaiwang@xxxxxxxxx> > > > > > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > > > > > Subject: Re: Adding compression support for bluestore. > > > > > > > > > > Hi Haomai, > > > > > Thanks for your comments. > > > > > Please find my response inline. > > > > > > > > > > On 2/16/2016 5:06 AM, Haomai Wang wrote: > > > > > > On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov > > > > > > <ifedotov@xxxxxxxxxxxx> > > > > > > wrote: > > > > > > > Hi guys, > > > > > > > Here is my preliminary overview how one can add compression > > > > > > > support allowing random reads/writes for bluestore. > > > > > > > > > > > > > > Preface: > > > > > > > Bluestore keeps object content using a set of dispersed > > > > > > > extents aligned by 64K (configurable param). It also permits > > > > > > > gaps in object content i.e. it prevents storage space > > > > > > > allocation for object data regions unaffected by user writes. > > > > > > > A sort of following mapping is used for tracking stored > > > > > > > object content disposition (actual current implementation > > > > > > > may differ but representation below seems to be sufficient for > our purposes): > > > > > > > Extent Map > > > > > > > { > > > > > > > < logical offset 0 -> extent 0 'physical' offset, extent 0 > > > > > > > size > ... > > > > > > > < logical offset N -> extent N 'physical' offset, extent N > > > > > > > size > } > > > > > > > > > > > > > > > > > > > > > Compression support approach: > > > > > > > The aim is to provide generic compression support allowing > > > > > > > random object read/write. > > > > > > > To do that compression engine to be placed (logically - > > > > > > > actual implementation may be discussed later) on top of > > > > > > > bluestore to "intercept" > > > > > > > read-write requests and modify them as needed. > > > > > > > The major idea is to split object content into fixed size > > > > > > > logical blocks ( MAX_BLOCK_SIZE, e.g. 1Mb). Blocks are > > > > > > > compressed independently. Due to compression each block can > > > > > > > potentially occupy smaller store space comparing to their > > > > > > > original size. Each block is addressed using original data offset ( > AKA 'logical offset' above ). > > > > > > > After compression is applied each block is written using the > > > > > > > existing bluestore infra. In fact single original write > > > > > > > request may affect multiple blocks thus it transforms into > > > > > > > multiple sub-write requests. > > > > > > > Block logical offset, compressed block data and compressed > > > > > > > data length are the parameters for injected sub-write requests. > > > > > > > As a result stored object content: > > > > > > > a) Has gaps > > > > > > > b) Uses less space if compression was beneficial enough. > > > > > > > > > > > > > > Overwrite request handling is pretty simple. Write request > > > > > > > data is splitted into fully and partially overlapping > > > > > > > blocks. Fully overlapping blocks are compressed and written > > > > > > > to the store (given the extended write functionality > > > > > > > described below). For partially overwlapping blocks ( no > > > > > > > more than 2 of them > > > > > > > - head and tail in general case) we need to retrieve > > > > > > > already stored blocks, decompress them, merge the existing > > > > > > > and received data into a block, compress it and save to the store > using new size. > > > > > > > The tricky thing for any written block is that it can be > > > > > > > both longer and shorter than previously stored one. However > > > > > > > it always has upper limit > > > > > > > (MAX_BLOCK_SIZE) since we can omit compression and use > > > > > > > original block if compression ratio is poor. Thus > > > > > > > corresponding bluestore extent for this block is limited too > > > > > > > and existing bluestore mapping doesn't > > > > > > > suffer: offsets are permanent and are equal to originally > > > > > > > ones provided by the caller. > > > > > > > The only extension required for bluestore interface is to > > > > > > > provide an ability to remove existing extents( specified by > > > > > > > logical offset, size). In other words we need write request > > > > > > > semantics extension ( rather by introducing an additional > extended write method). > > > > > > > Currently > > > > > > > overwriting request can either increase allocated space or > > > > > > > leave it unaffected only. And it can have arbitrary > > > > > > > offset,size parameters pair. Extended one should be able to > > > > > > > squeeze store space ( e.g. by removing existing extents for > > > > > > > a block and allocating reduced set of new ones) as well. And > > > > > > > extended write should be applied to a specific block only, > > > > > > > i.e. logical offset to be aligned with block start offset > > > > > > > and size limited to MAX_BLOCK_SIZE. It seems this is pretty > > > > > > > simple to add - most of the functionality for extent > > > > > > > append/removal if already present. > > > > > > > > > > > > > > To provide reading and (over)writing compression engine > > > > > > > needs to track additional block mapping: > > > > > > > Block Map > > > > > > > { > > > > > > > < logical offset 0 -> compression method, compressed block 0 > > > > > > > size > ... > > > > > > > < logical offset N -> compression method, compressed block N > > > > > > > size > } Please note that despite the similarity with the > > > > > > > original bluestore extent map the difference is in record > > > > > > > granularity: 1Mb vs 64Kb. > > > > > > > Thus > > > > > > > each block mapping record might have multiple corresponding > > > > > > > extent mapping records. > > > > > > > > > > > > > > Below is a sample of mappings transform for a pair of overwrites. > > > > > > > 1) Original mapping ( 3 Mb were written before, compress > > > > > > > ratio 2 for each > > > > > > > block) > > > > > > > Block Map > > > > > > > { > > > > > > > 0 -> zlib, 512Kb > > > > > > > 1Mb -> zlib, 512Kb > > > > > > > 2Mb -> zlib, 512Kb > > > > > > > } > > > > > > > Extent Map > > > > > > > { > > > > > > > 0 -> 0, 512Kb > > > > > > > 1Mb -> 512Kb, 512Kb > > > > > > > 2Mb -> 1Mb, 512Kb > > > > > > > } > > > > > > > 1.5Mb allocated [ 0, 1.5 Mb] range ) > > > > > > > > > > > > > > 1) Result mapping ( after overwriting 1Mb data at 512 Kb > > > > > > > offset, compress ratio 1 for both affected blocks) Block Map { > > > > > > > 0 -> none, 1Mb > > > > > > > 1Mb -> none, 1Mb > > > > > > > 2Mb -> zlib, 512Kb > > > > > > > } > > > > > > > Extent Map > > > > > > > { > > > > > > > 0 -> 1.5Mb, 1Mb > > > > > > > 1Mb -> 2.5Mb, 1Mb > > > > > > > 2Mb -> 1Mb, 512Kb > > > > > > > } > > > > > > > 2.5Mb allocated ( [1Mb, 3.5 Mb] range ) > > > > > > > > > > > > > > 2) Result mapping ( after (over)writing 3Mb data at 1Mb > > > > > > > offset, compress ratio 4 for all affected blocks) Block Map { > > > > > > > 0 -> none, 1Mb > > > > > > > 1Mb -> zlib, 256Kb > > > > > > > 2Mb -> zlib, 256Kb > > > > > > > 3Mb -> zlib, 256Kb > > > > > > > } > > > > > > > Extent Map > > > > > > > { > > > > > > > 0 -> 1.5Mb, 1Mb > > > > > > > 1Mb -> 0Mb, 256Kb > > > > > > > 2Mb -> 0.25Mb, 256Kb > > > > > > > 3Mb -> 0.5Mb, 256Kb > > > > > > > } > > > > > > > 1.75Mb allocated ( [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb ) > > > > > > > > > > > > > Thanks for Igore! > > > > > > > > > > > > Maybe I'm missing something, is it compressed inline not offline? > > > > > That's about inline compression. > > > > > > If so, I guess we need to provide with more flexible controls > > > > > > to upper, like explicate compression flag or compression unit. > > > > > Yes I agree. We need a sort of control for compression - on per > > > > > object or per pool basis... > > > > > But at the overview above I was more concerned about algorithmic > > > > > aspect i.e. how to implement random read/write handling for > compressed objects. > > > > > Compression management from the user side can be considered a bit > later. > > > > > > > > > > > > Any comments/suggestions are highly appreciated. > > > > > > > > > > > > > > Kind regards, > > > > > > > Igor. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > ceph-devel" > > > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > > > > > majordomo info at > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > Thanks, > > > > > Igor > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > ceph-devel" in the body of a message to > > > > > majordomo@xxxxxxxxxxxxxxx More majordomo info at > > > > > http://vger.kernel.org/majordomo-info.html > > > > > PLEASE NOTE: The information contained in this electronic mail > > > > > message is intended only for the use of the designated recipient(s) > named above. > > > > > If the reader of this message is not the intended recipient, you > > > > > are hereby notified that you have received this message in error > > > > > and that any review, dissemination, distribution, or copying of > > > > > this message is strictly prohibited. If you have received this > > > > > communication in error, please notify the sender by telephone or > > > > > e-mail (as shown above) immediately and destroy any and all > > > > > copies of this message in your possession (whether hard copies or > electronically stored copies). > > > > > N?????r??y??????X??ǧv???){.n?????z?]z????ay?ʇڙ??j > ??f???h??????w??? > > > ???j:+v???w???????? ????zZ+???????j"????i > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f