On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@xxxxxxxxxxxx> wrote: > Hi guys, > Here is my preliminary overview how one can add compression support allowing > random reads/writes for bluestore. > > Preface: > Bluestore keeps object content using a set of dispersed extents aligned by > 64K (configurable param). It also permits gaps in object content i.e. it > prevents storage space allocation for object data regions unaffected by user > writes. > A sort of following mapping is used for tracking stored object content > disposition (actual current implementation may differ but representation > below seems to be sufficient for our purposes): > Extent Map > { > < logical offset 0 -> extent 0 'physical' offset, extent 0 size > > ... > < logical offset N -> extent N 'physical' offset, extent N size > > } > > > Compression support approach: > The aim is to provide generic compression support allowing random object > read/write. > To do that compression engine to be placed (logically - actual > implementation may be discussed later) on top of bluestore to "intercept" > read-write requests and modify them as needed. > The major idea is to split object content into fixed size logical blocks ( > MAX_BLOCK_SIZE, e.g. 1Mb). Blocks are compressed independently. Due to > compression each block can potentially occupy smaller store space comparing > to their original size. Each block is addressed using original data offset ( > AKA 'logical offset' above ). After compression is applied each block is > written using the existing bluestore infra. In fact single original write > request may affect multiple blocks thus it transforms into multiple > sub-write requests. Block logical offset, compressed block data and > compressed data length are the parameters for injected sub-write requests. > As a result stored object content: > a) Has gaps > b) Uses less space if compression was beneficial enough. > > Overwrite request handling is pretty simple. Write request data is splitted > into fully and partially overlapping blocks. Fully overlapping blocks are > compressed and written to the store (given the extended write functionality > described below). For partially overwlapping blocks ( no more than 2 of them > - head and tail in general case) we need to retrieve already stored blocks, > decompress them, merge the existing and received data into a block, compress > it and save to the store using new size. > The tricky thing for any written block is that it can be both longer and > shorter than previously stored one. However it always has upper limit > (MAX_BLOCK_SIZE) since we can omit compression and use original block if > compression ratio is poor. Thus corresponding bluestore extent for this > block is limited too and existing bluestore mapping doesn't suffer: offsets > are permanent and are equal to originally ones provided by the caller. > The only extension required for bluestore interface is to provide an ability > to remove existing extents( specified by logical offset, size). In other > words we need write request semantics extension ( rather by introducing an > additional extended write method). Currently overwriting request can either > increase allocated space or leave it unaffected only. And it can have > arbitrary offset,size parameters pair. Extended one should be able to > squeeze store space ( e.g. by removing existing extents for a block and > allocating reduced set of new ones) as well. And extended write should be > applied to a specific block only, i.e. logical offset to be aligned with > block start offset and size limited to MAX_BLOCK_SIZE. It seems this is > pretty simple to add - most of the functionality for extent append/removal > if already present. > > To provide reading and (over)writing compression engine needs to track > additional block mapping: > Block Map > { > < logical offset 0 -> compression method, compressed block 0 size > > ... > < logical offset N -> compression method, compressed block N size > > } > Please note that despite the similarity with the original bluestore extent > map the difference is in record granularity: 1Mb vs 64Kb. Thus each block > mapping record might have multiple corresponding extent mapping records. > > Below is a sample of mappings transform for a pair of overwrites. > 1) Original mapping ( 3 Mb were written before, compress ratio 2 for each > block) > Block Map > { > 0 -> zlib, 512Kb > 1Mb -> zlib, 512Kb > 2Mb -> zlib, 512Kb > } > Extent Map > { > 0 -> 0, 512Kb > 1Mb -> 512Kb, 512Kb > 2Mb -> 1Mb, 512Kb > } > 1.5Mb allocated [ 0, 1.5 Mb] range ) > > 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, compress > ratio 1 for both affected blocks) > Block Map > { > 0 -> none, 1Mb > 1Mb -> none, 1Mb > 2Mb -> zlib, 512Kb > } > Extent Map > { > 0 -> 1.5Mb, 1Mb > 1Mb -> 2.5Mb, 1Mb > 2Mb -> 1Mb, 512Kb > } > 2.5Mb allocated ( [1Mb, 3.5 Mb] range ) > > 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, compress > ratio 4 for all affected blocks) > Block Map > { > 0 -> none, 1Mb > 1Mb -> zlib, 256Kb > 2Mb -> zlib, 256Kb > 3Mb -> zlib, 256Kb > } > Extent Map > { > 0 -> 1.5Mb, 1Mb > 1Mb -> 0Mb, 256Kb > 2Mb -> 0.25Mb, 256Kb > 3Mb -> 0.5Mb, 256Kb > } > 1.75Mb allocated ( [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb ) > Thanks for Igore! Maybe I'm missing something, is it compressed inline not offline? If so, I guess we need to provide with more flexible controls to upper, like explicate compression flag or compression unit. > > Any comments/suggestions are highly appreciated. > > Kind regards, > Igor. > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html