RE: Adding compression support for bluestore.

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Tue, 29 Mar 2016 20:45:14 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> Sent: Tuesday, March 29, 2016 1:20 PM
> To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: Re: Adding compression support for bluestore.
> 
> On Thu, 24 Mar 2016, Igor Fedotov wrote:
> > Sage, Allen et. al.
> >
> > Please find some follow-up on our discussion below.
> >
> > Your past and future comments are highly appreciated.
> >
> > WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES
> OVERVIEW.
> >
> > Used terminology:
> > Extent - basic allocation unit. Variable in size, maximum size is
> > limited by lblock length (see below), alignment: min_alloc_unit param
> > (configurable, expected range: 4-64 Kb .
> > Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
> > Alignment unspecified. Max size limited by max_logical_unit param
> > (configurable, expected range: 128-512 Kb)
> >
> > Compression to be applied on per-extent basis.
> > Multiple lblocks can refer specific region within a single extent.
> 
> This (and the what's below) sound right to me.  My main concern is around
> naming.  I don't much like "extent" vs "lblock" (which is which?).  Maybe
> extent and extent_ref?
> 
> Also, I don't think we need the size limits you mention above.  When
> compression is enabled, we'll limit the size of the disk extents by policy, but
> the structures themselves needn't enforce that.  Similarly, I don't think the
> lblocks (extent refs?  logical extents?) need a max size either.
> 
> Anyway, right now we have bluestore_extent_t.  I'd suggest maybe
> 
> 	bluestore_pextent_t and bluestore_lextent_t or
> 	bluestore_extent_t and bluestore_extent_ref_t
> 
> ?

I prefer the lextent and pextent variant. 

Can't we move all of these into a namespace, i.e., bluestore::lextent_t, bluestore::pextent_t, bluestore::onode_t, bluestore::bdev_label_t, etc.. That way the code within Bluestore itself doesn't have to keep redundantly repeating itself with super-long type names...

> 
> > POTENTIAL COMPRESSION APPLICATION POLICIES
> >
> > 1) Read/Merge/Write at initial commit phase. (RMW) General approach:
> > New write request triggers partially overlapped lblock(s)
> > reading/decompression followed by their merge into a set of new
> > lblocks. Then compression is (optionally) applied. Resulting lblocks
> > overwrite existing ones.
> > For non-overlapping/fully overlapped lblocks read/merge steps are
> > simply bypassed.
> > - Read, merge and final compression take place prior to write commit
> > ack that can impact write operation latency.
> >
> > 2) Deferred RMW for partial overlaps. (DRMW) General approach:
> > Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
> > For partially overlapped lblocks one should use Write-Ahead Log to
> > defer RMW procedure until write commit ack return.
> > - Write operation latency can still be high in some cases (
> > non-overlapped/fully overlapped writes).
> > - WAL can grow significantly.
> >
> > 3) Writing new lblocks over new extents. (LBlock Bedding?) General
> > approach:
> > Write request creates new lblock(s) that use freshly allocated extents.
> > Overlapped regions within existing lblocks are occluded.
> > Previously existing extents are preserved for some time (or while
> > being used) depending on the cleanup policy.
> > Compression to be performed before write commit ack return.
> > - Write operation latency is still affected by the compression.
> > - Store space usage is usually higher.
> >
> > 4) Background compression (BCOMP)
> > General approach:
> > Write request to be handled using any of the above policies (or their
> > combination) with no compression applied. Stored extents are
> > compressed by some background process independently from the client
> write flow.
> > Merging new uncompressed lblock with already compressed one can be
> > tricky here.
> > + Write operation latency isn't affected by the compression.
> > - Double disk write occurs
> >
> > To provide better user experience above-mentioned policies can be used
> > together depending on the write pattern.
> >
> > INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
> >
> > To track object content we need to introduce following 2 collections:
> >
> > 1) LBlock map:
> > That's a logical offset mapping to a region within an extent:
> > LOFFS -> {
> >   EXTENT_REF       - reference to an underlying extent, e.g. pointer for
> > in-memory representation or extent ID for "on-disk" one
> >   X_OFFS, X_LEN,   - region descriptor within an extent: relative offset and
> > region length
> >   LFLAGS           - some associated flags for the lblock. Any usage???
> > }
> >
> > 2) Extent collection:
> > Each entry describes an allocation unit within storage space.
> > Compression to be applied on per-extent basis thus extent's logical
> > volume can be greater than it's physical size.
> >
> > {
> >   P_OFFS            - physical block address
> >   SIZE              - actual stored data length
> >   EFLAGS            - flags associated with the extent
> >   COMPRESSION_ALG   - An applied compression algorithm id if any
> >   CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
> >   REFCOUNT          - Number of references to this entry
> > }
> 
> Yep (modulo naming).
> 
> > The possible container for this collection can be a mapping: id ->
> > extent. It looks like such mapping is required during on-disk to
> > in-memory representation transform as smart pointer seems to be enough
> for in-memory use.
> 
> Given the structures are small I'm not sure smart pointers are worth it..
> Maybe just a simple vector (or maybe flat_map) for the extents?  Lookup will
> be fast.
> 

Smart pointers don't work well in the code. The deallocation of the pextent is more than just freeing the memory when the lextent reference count goes to zero -- It also includes the updating of a transaction to mutate the KV store to match the deallocation. Thus the destructor needs a reference to the KeyValueDB::transaction, which isn't really clean and easy to arrange (you'll have to hide it in the object, or some other ugly hack). From a coding perspective, I think you'll just have manually managed reference counts with explicit deallocation calls that pass in the right parameters.

> > SAMPLE MAP TRANSFORMATION FOR LBLOCK BEDDING POLICY ( all values
> in Kb
> > )
> >
> > Config parameters:
> > min_alloc_unit = 4
> > max_logical_unit = 64
> >
> > --------------------------------------------------------
> > ****** Step 0 :
> > ->Write(0, 50), no compression
> > ->Write(100, 60), no compression
> >
> > Resulting maps:
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:   {EO1, 0, 50}
> > 100: {EO2, 0, 60}
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> >
> >
> > Where POFFS_1, POFFS_2 - physical addresses for allocated extents.
> >
> > ****** Step 1
> > ->Write(25, 100), compressed
> >
> > Resulting maps:
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO1, 0, 25}
> > 25:    {EO3, 0, 64}   //compressed into 20K
> > 79:    {EO4, 0, 36}   //compressed into 15K
> > 125:   {EO2, 25, 35}
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > EO4: { POFFS_4, 15, ZLIB, 1}   //totally allocated 16 Kb
> >
> > As one can see new entries at offset 25 & 79  have appeared and
> > previous entries have been altered (including the map key (100->125)
> > for the last entry).
> > No physical extents reallocation took place though - just new ones
> > (EO3 & EO4) have been allocated.
> > Please note that client accessible data for block EO2 are actually
> > stored at
> > P_OFFS_2 + X_OFF and have 35K only despite the fact that extent has 60K
> total.
> > The same for block EO1 - valid data length is 25K only.
> > Extent EO3 actually stores 20K of compressed data corresponding to 64K
> > raw one.
> > Extent EO4 actually stores 15K of compressed data corresponding to 36K
> > raw one.
> > Single 100K write has been splitted into 2 lblocks to address
> > max_logical_unit constraint
> 
> Hmm, as a matter of policy, we might want to force alignment of the extents
> to max_logical_unit.  I think that might reduce fragmentation over time.
> 
> > ****** Step 2
> > ->Write(70, 65), no compression
> >
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO1, 0, 25}
> > 25:    {EO3, 0, 45}
> > 70:    {EO5, 0, 65}
> > -125:   {EO4, 36, 0} -> to be removed as it's totally overwritten ( see X_LEN
> > = 0 )
> > 135:   {EO2, 35, 25}
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > -EO4: { POFFS_4, 15, ZLIB, 0}  //totally allocated 16 Kb, can be
> > released as refcount = 0
> > EO5: { POFFS_5, 65, NONE, 1}   //totally allocated 68 Kb
> >
> > Entry at at offset 25 entry has been altered and entry at offset 125
> > to be removed. The latter can be done both immediately on map
> > alteration and by some background cleanup procedure.
> >
> >
> > ****** Step 3
> > ->Write(100, 60), compressed to 30K
> >
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO1, 0, 25}
> > 25:    {EO3, 0, 45}
> > 70:    {EO5, 0, 65}
> > 100:   {EO6, 0, 60}
> > -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
> > = 0 )
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > -EO5: { POFFS_5, 65, NONE, 0}  //totally allocated 68 Kb, can be
> > released as refcount = 0
> > EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> >
> > Entry at offset 100 has been altered and entry at offset 160 to be removed.
> >
> > ****** Step 4
> > ->Write(0, 25), no compression
> >
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO7, 0, 25}
> > -25:     {EO1, 25, 0}   -> to be removed
> > 25:    {EO3, 0, 45}
> > 70:    {EO5, 0, 65}
> > 100:   {EO6, 0, 60}
> > -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
> > = 0 )
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > -EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb, can be released as
> > refcount = 0
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> > EO7: { POFFS_7, 25, None, 1}   //totally allocated 38 Kb
> >
> > Entry at offset 0 has been overwritten and to be removed.
> >
> > IMPLMENTATION ROADMAP
> 
> .5) Code and review the new data structures.  Include fields and flags for
> both compressoin and checksums.
> 
> > 1) Refactor current Bluestore implementation to introduce the
> > suggested twin-structure design.
> > This will support raw data READ/WRITE without compression. Major
> > policy to implement is lblock bedding.
> > As an additional option DRMW to be implemented to provide a solution
> > equal to the current implementation. This might be useful for performance
> comparison.
> >
> > 2) Add basic compression support using lblock bedding policy.
> > This will lack most of management/statistics features too.
> >
> > 3) Add compression management/statistics. Design to be discussed.
> >
> > 4) Add check sum support. Goals and design to be discussed.
> 
> This sounds good to me!
> 
> FWIW, I think #1 is going to be the hard part.  Once we establish that the disk
> extents are somewhat immutable (because they are compressed or there is
> a coarse checksum or whatever) we'll have to restructure _do_write,
> _do_zero, _do_truncate, and _do_wal_op.  Those four are dicey.
> 
> sage
> 
> 
> 
> > 5) Add RMW/DRMW policies [OPTIONAL]
> >
> > 6) Add background task support for
> compression/defragmentation/cleanup.
> >
> >
> > Thanks,
> > Igor.
> >
> > On 21.03.2016 18:50, Sage Weil wrote:
> > > On Mon, 21 Mar 2016, Igor Fedotov wrote:
> > > > On 19.03.2016 6:14, Allen Samuels wrote:
> > > > > If we're going to both allow compression and delayed overwrite
> > > > > we simply have to handle the case where new data actually
> > > > > overlaps with previous data
> > > > > -- recursively. If I understand the current code, it handles
> > > > > exactly one layer of overlay which is always stored in KV store.
> > > > > We need to generalize this data structure. I'm going to outline
> > > > > a proposal, which If I get wrong, I beg forgiveness -- I'm not
> > > > > as familiar with this code as I would like, especially the
> > > > > ref-counted shared extent stuff. But I'm going to blindly dive
> > > > > in and assume that Sage will correct me when I go off the tracks
> > > > > -- and therefore end up learning how all of this stuff REALLY
> > > > > works.
> > > > >
> > > > > I propose that the current bluestore_extent_t and
> > > > > bluestore_overlay_t be essentially unified into a single
> > > > > structure with a typemark to distinguish between being in KV
> > > > > store or in raw block storage. Here's an example:
> > > > > (for
> > > > > this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O
> size).
> > > > >
> > > > > Struct bluestore_extent_t {
> > > > >      Uint64_t logical_size;			// size of data before
> > > > > any
> > > > > compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and !=
> 0)
> > > > >      Uint64_t physical_size;                              // size of
> > > > > data on
> > > > > physical media (yes, this is unneeded when location == KV, the
> > > > > serialize/deserialize could compress this out --  but this is an
> > > > > unneeded optimization
> > > > >      Uint64_t location:1;                                    // values
> > > > > (in
> > > > > ENUM form) are "KV" and "BLOCK"
> > > > >      Uint64_t compression_alg:4;                  // compression
> > > > > algorithm...
> > > > >      Uint64_t otherflags:xx;                             // round it
> > > > > out.
> > > > >      Uint64_t media_address;                        // forms Key when
> > > > > location == KV block address when location == BLOCK
> > > > >      Vector<uint32_t> checksums;              // Media checksums. See
> > > > > commentary on this below.
> > > > > };
> > > > >
> > > > > This allows any amount of compressed or uncompressed data to be
> > > > > identified in either a KV key or a block store.
> > > > >
> > > > As promised please find a competing proposal for extent map
> > > > structure. It can be used for handling unaligned overlapping
> > > > writes of both compressed/uncompressed data. It seems it's
> > > > applicable for any compression policy but my primary intention was
> > > > to allow overwrites that use totally different extents without the
> > > > touch to the existing(overwritten) ones.
> > > > I.e.
> > > > that's what Sage explained this way some time ago:
> > > >
> > > > "b) we could just leave the overwritten extents alone and
> > > > structure the block_map so that they are occluded.  This will
> > > > 'leak' space for some write patterns, but that might be okay given
> > > > that we can come back later and clean it up, or refine our strategy to be
> smarter."
> > > >
> > > > Nevertheless the corresponding infrastructure seems to be
> > > > applicable for different use cases too.
> > > >
> > > > At first let's consider simple raw data overwrite case. No
> > > > compression, checksums, flags at this point for the sake of simplicity.
> > > > Block map entry to be defined as follows:
> > > > OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN> where EXT_OFFS,
> EXT_LEN
> > > > - allocated extent offset and size, AKA physical address and size.
> > > > X_OFFS - relative offset within the block where valid (not
> > > > overwritten) data starts. Full data offset = OFFS + X_OFFS X_LEN -
> > > > valid data size.
> > > > Invariant: Block length == X_OFFS + X_LEN
> > > >
> > > > Let's consider sample block map transform:
> > > > --------------------------------------------------------
> > > > ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
> > > > ->Write(0,50)
> > > > ->Write(100, 50)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 50}
> > > > 100: {EO2, 50, 0, 50}
> > > >
> > > > Where EO1, EO2 - physical addresses for allocated extents.
> > > > Two new entries have been inserted.
> > > >
> > > > ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
> > > > ->Write(25,100)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 25}
> > > > 25:    {EO3, 100, 0, 100}
> > > > 125: {EO2, 50, 25, 25}
> > > >
> > > > As one can see new entry at offset 25 has appeared and previous
> > > > entries have been altered (including the map key (100->125) for
> > > > the last entry). No physical extents reallocation took place
> > > > though - just a new one at E03 has been allocated.
> > > > Please note that client accessible data for block EO2 are actually
> > > > stored at
> > > > EO2 + X_OFF(=25) and have 25K only despite the fact that extent
> > > > has 50K total.
> > > > The same for block EO1 - valid data length = 25K only.
> > > >
> > > >
> > > > ****** Step 2 ( overwrite that partially overlaps existing blocks once
> > > > again):
> > > > ->Write(70, 65)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 25}
> > > > 25:    {EO3, 100, 0, 45}
> > > > 70:    {EO4, 65, 0, 65}
> > > > 135: {EO2, 50, 35, 15}
> > > >
> > > > Yet another new entry. Overlapped block entries at 25 & 125 were
> altered.
> > > >
> > > > ****** Step 3 ( overwrite that partially overlaps one block and totally
> > > > overwrite the last one):
> > > > ->Write(100, 60)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 25}
> > > > 25:    {EO3, 100, 0, 45}
> > > > 70:    {EO4, 65, 0, 35}
> > > > 100: {EO5, 60, 0, 60}
> > > > -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
> > > > X_LEN = 0 )
> > > >
> > > > Entry for EO4 have been altered and entry EO2 to be removed. The
> latter
> > > > can be
> > > > done both immediately on map alteration and by some background
> cleanup
> > > > procedure.
> > > >
> > > > ****** Step 4 ( overwrite that totally overlap the first block):
> > > > ->Write(0, 25)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO6, 25, 0, 25}
> > > > - 0:      {EO1, 50, 25, 0}  -> to be removed
> > > > 25:    {EO3, 100, 0, 45}
> > > > 70:    {EO4, 65, 0, 35}
> > > > 100: {EO5, 60, 0, 60}
> > > >
> > > > Entry for EO1 has been overwritten and to be removed.
> > > >
> > > > --------------------------------------------------------------------------------------
> > > >
> > > > Extending this block map for compression is trivial - we need to
> introduce
> > > > compression algorithm flag to the map. And vary EXT_LEN (and actual
> > > > physical
> > > > allocation) depending on the actual compression ratio.
> > > > E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn
> > > > into
> > > > :
> > > > 100: {EO5, 20, 0, 60}
> > > >
> > > > Other compression aspects handled by the corresponding policies ( e.g.
> > > > when
> > > > perform the compression ( immediately, lazily or totally in background )
> > > > or
> > > > how to merge neighboring compressed blocks ) probably don't impact
> the
> > > > structure of the map entry - they just shuffle the entries.
> > > This is much simpler!  There is one case we need to address that I don't
> > > see above, though.  Consider,
> > >
> > > - write 0~1048576, and compress it
> > > - write 16384~4096
> > >
> > > When we split the large extent into two pieces, the resulting extent map,
> > > as per above, would be something like
> > >
> > > 0:      {EO1, 1048576, 0, 4096, zlib}
> > > 4096:   {E02, 16384, 0, 4096, uncompressed}
> > > 16384:	{E01, 1048576, 20480, 1028096, zlib}
> > >
> > > ...which is fine, except that it's the *same* compressed extent, which
> > > means the code that decides that the physical extent is no longer
> > > referenced and can be released needs to ensure that no other extents in
> > > the map reference it.  I think that's an O(n) pass across the map when
> > > releasing.
> > >
> > > Also, if we add in checksums, then we'd be duplicating them in the two
> > > instances that reference the raw extent.
> > >
> > > I wonder if it makes sense to break this into two structures.. one that
> > > lists the raw extents, and another that maps them into the logical space.
> > > So that there is one record for {E01, 1048576, zlib, checksums}, and then
> > > the block map is more like
> > >
> > > 0:      {E0, 0, 4096}
> > > 4096:   {E1, 0, 4096}
> > > 16384:	{E0, 20480, 1028096}
> > >
> > > and
> > >
> > > 0: E01, 1048576, 0, 4096, zlib, checksums
> > > 1: E02, 16384, 0, 4096, uncompressed, checksums
> > >
> > > ?
> > >
> > > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html