RE: Adding compression support for bluestore.

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 16 Mar 2016 19:41:56 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
> Sent: Wednesday, March 16, 2016 2:28 PM
> To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: Re: Adding compression support for bluestore.
> 
> On Wed, 16 Mar 2016, Igor Fedotov wrote:
> > On 15.03.2016 20:12, Sage Weil wrote:
> > > My current thinking is that we do something like:
> > >
> > > - add a bluestore_extent_t flag for FLAG_COMPRESSED
> > > - add uncompressed_length and compression_alg fields
> > > (- add a checksum field we are at it, I guess)
> > >
> > > - in _do_write, when we are writing a new extent, we need to
> > > compress it in memory (up to the max compression block), and feed
> > > that size into _do_allocate so we know how much disk space to
> > > allocate.  this is probably reasonably tricky to do, and handles
> > > just the simplest case (writing a new extent to a new object, or
> > > appending to an existing one, and writing the new data compressed).
> > > The current _do_allocate interface and responsibilities will probably need
> to change quite a bit here.
> > sounds good so far
> > > - define the general (partial) overwrite strategy.  I would like for
> > > this to be part of the WAL strategy.  That is, we do the
> > > read/modify/write as deferred work for the partial regions that overlap
> existing extents.
> > > Then _do_wal_op would read the compressed extent, merge it with the
> > > new piece, and write out the new (compressed) extents.  The problem
> > > is that right now the WAL path *just* does IO--it doesn't do any kv
> > > metadata updates, which would be required here to do the final
> > > allocation (we won't know how big the resulting extent will be until
> > > we decompress the old thing, merge it with the new thing, and
> recompress).
> > >
> > > But, we need to address this anyway to support CRCs (where we will
> > > similarly do a read/modify/write, calculate a new checksum, and need
> > > to update the onode).  I think the answer here is just that the
> > > _do_wal_op updates some in-memory-state attached to the wal
> > > operation that gets applied when the wal entry is cleaned up in
> > > _kv_sync_thread (wal_cleaning list).
> > >
> > > Calling into the allocator in the WAL path will be more complicated
> > > than just updating the checksum in the onode, but I think it's doable.
> > Could you please name the issues for calling allocator in WAL path?
> > Proper locking? What else?
> 
> I think this bit isn't so bad... we need to add another field to the in-memory
> wal_op struct that includes space allocated in the WAL stage, and make sure
> that gets committed by the kv thread for all of the wal_cleaning txc's.
> 
> > A potential issue with using WAL for compressed block overwrites is
> > significant WAL data volume increase. IIUC currently WAL record can
> > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per
> > single write request - overlapped head and tail.
> > In case of compressed blocks this will be up to
> > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> > overwrite fully overlapped extents - one should operate compression
> blocks now...
> >
> > Seems attractive otherwise...
> 
> I think the way to address this is to make bluestore_max_compressed_block
> *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That gives us a
> smallish rounding error of "lost" efficiency, but keeps the size of extents we
> have to read+decompress in the overwrite or small read cases reasonable.
> 

Yes, this is generally what people do.  It's very hard to have a large compression window without having the CPU times balloon up.

> The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB it's still
> only 5-10 records, which sounds fine to me.
> 
> > > The alternative is that we either
> > >
> > > a) do the read side of the overwrite in the first phase of the op,
> > > before we commit it.  That will mean a higher commit latency and
> > > will slow down the pipeline, but would avoid the double-write of the
> > > overlap/wal regions.  Or,
> > This is probably the simplest approach without hidden caveats but
> > latency increase.
> > >
> > > b) we could just leave the overwritten extents alone and structure
> > > the block_map so that they are occluded.  This will 'leak' space for
> > > some write patterns, but that might be okay given that we can come
> > > back later and clean it up, or refine our strategy to be smarter.
> > Just to clarify I understand the idea properly. Are you suggesting to
> > simply write out new block to a new extent and update block map (and
> > read procedure) to use that new extent or remains of the overwritten
> > extents depending on the read offset? And overwritten extents are
> > preserved intact until they are fully hidden or some background cleanup
> procedure merge them.
> > If so I can see following pros and cons:
> > + write is faster
> > - compressed data read is potentially slower as you might need to
> > decompress more compressed blocks.
> > - space usage is higher
> > - need for garbage collector i.e. additional complexity
> >
> > Thus the question is what use patterns are at foreground and should be
> > the most effective.
> > IMO read performance and space saving are more important for the cases
> > where compression is needed.
> >
> > > What do you think?
> > >
> > > It would be nice to choose a simpler strategy for the first pass
> > > that handles a subset of write patterns (i.e., sequential writes,
> > > possibly
> > > unaligned) that is still a step in the direction of the more robust
> > > strategy we expect to implement after that.
> > >
> > I'd probably agree but.... I don't see a good way how one can
> > implement compression for specific write patterns only.
> > We need to either ensure that these patterns are used exclusively (
> > append only / sequential only flags? ) or provide some means to fall
> > back to regular mode when inappropriate write occurs.
> > Don't think both are good and/or easy enough.
> 
> Well, if we simply don't implement a garbage collector, then for
> sequential+aligned writes we don't end up with stuff that needs garbage
> collection.  Even the sequential case might be doable if we make it possible
> to fill the extent with a sequence of compressed strings (as long as we
> haven't reached the compressed length, try to restart the decompression
> stream).
> 
> > In this aspect my original proposal to have compression engine more or
> > less segregated from the bluestore seems more attractive - there is no
> > need to refactor bluestore internals in this case. One can easily
> > start using compression or drop it and fall back to the current code
> > state. No significant modifications in run-time data structures and
> algorithms....
> 
> It sounds like in theory, but when I try to sort out how it would actually work,
> it seems like you have to either expose all of the block_map metadata up to
> this layer, at which point you may as well do it down in BlueStore and have
> the option of deferred WAL work, or you do something really simple with
> fixed compression block sizes and get a weak final result.  Not to mention the
> EC problems (although some of that will go away when EC overwrites come
> along)...
> 
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html