Re: Adding compression support for bluestore.

Blair Bethwaite <blair.bethwaite@xxxxxxxxx> · Thu, 17 Mar 2016 09:56:59 +1100

This time without html (thanks gmail)!

On 17 March 2016 at 09:43, Blair Bethwaite <blair.bethwaite@xxxxxxxxx> wrote:
> Hi Igor, Allen, Sage,
>
> Apologies for the interjection into the technical back-and-forth here, but I
> want to ask a question / make a request from the user/operator perspective
> (possibly relevant to other advanced bluestore features too)...
>
> Can a feature like this expose metrics (e.g., compression ratio) back up to
> higher layers such as rados that could then be used to automate use of the
> feature? As a user/operator implicit compression support in the backend is
> exciting, but it's something I'd want rados/librbd capable of toggling
> on/off automatically based on a threshold (e.g., librbd could toggle
> compression off at the image level if the first n rados objects
> written/edited since turning compression on are compressed less than c%) -
> this sort of thing would obviously help to avoid unnecessary overheads and
> would cater to mixed use-cases (e.g. cloud provider block storage) where in
> general the operator wants compression on but has no idea what users are
> doing with their internal filesystems, it'd also mesh nicely with any future
> "distributed"-compression implemented at the librbd client-side (which would
> again likely be an rbd toggle).
>
> Cheers,
>
> On 17 March 2016 at 06:41, Allen Samuels <Allen.Samuels@xxxxxxxxxxx> wrote:
>>
>> > -----Original Message-----
>> > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
>> > Sent: Wednesday, March 16, 2016 2:28 PM
>> > To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>
>> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
>> > devel@xxxxxxxxxxxxxxx>
>> > Subject: Re: Adding compression support for bluestore.
>> >
>> > On Wed, 16 Mar 2016, Igor Fedotov wrote:
>> > > On 15.03.2016 20:12, Sage Weil wrote:
>> > > > My current thinking is that we do something like:
>> > > >
>> > > > - add a bluestore_extent_t flag for FLAG_COMPRESSED
>> > > > - add uncompressed_length and compression_alg fields
>> > > > (- add a checksum field we are at it, I guess)
>> > > >
>> > > > - in _do_write, when we are writing a new extent, we need to
>> > > > compress it in memory (up to the max compression block), and feed
>> > > > that size into _do_allocate so we know how much disk space to
>> > > > allocate.  this is probably reasonably tricky to do, and handles
>> > > > just the simplest case (writing a new extent to a new object, or
>> > > > appending to an existing one, and writing the new data compressed).
>> > > > The current _do_allocate interface and responsibilities will
>> > > > probably need
>> > to change quite a bit here.
>> > > sounds good so far
>> > > > - define the general (partial) overwrite strategy.  I would like for
>> > > > this to be part of the WAL strategy.  That is, we do the
>> > > > read/modify/write as deferred work for the partial regions that
>> > > > overlap
>> > existing extents.
>> > > > Then _do_wal_op would read the compressed extent, merge it with the
>> > > > new piece, and write out the new (compressed) extents.  The problem
>> > > > is that right now the WAL path *just* does IO--it doesn't do any kv
>> > > > metadata updates, which would be required here to do the final
>> > > > allocation (we won't know how big the resulting extent will be until
>> > > > we decompress the old thing, merge it with the new thing, and
>> > recompress).
>> > > >
>> > > > But, we need to address this anyway to support CRCs (where we will
>> > > > similarly do a read/modify/write, calculate a new checksum, and need
>> > > > to update the onode).  I think the answer here is just that the
>> > > > _do_wal_op updates some in-memory-state attached to the wal
>> > > > operation that gets applied when the wal entry is cleaned up in
>> > > > _kv_sync_thread (wal_cleaning list).
>> > > >
>> > > > Calling into the allocator in the WAL path will be more complicated
>> > > > than just updating the checksum in the onode, but I think it's
>> > > > doable.
>> > > Could you please name the issues for calling allocator in WAL path?
>> > > Proper locking? What else?
>> >
>> > I think this bit isn't so bad... we need to add another field to the
>> > in-memory
>> > wal_op struct that includes space allocated in the WAL stage, and make
>> > sure
>> > that gets committed by the kv thread for all of the wal_cleaning txc's.
>> >
>> > > A potential issue with using WAL for compressed block overwrites is
>> > > significant WAL data volume increase. IIUC currently WAL record can
>> > > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per
>> > > single write request - overlapped head and tail.
>> > > In case of compressed blocks this will be up to
>> > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
>> > > overwrite fully overlapped extents - one should operate compression
>> > blocks now...
>> > >
>> > > Seems attractive otherwise...
>> >
>> > I think the way to address this is to make
>> > bluestore_max_compressed_block
>> > *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That gives
>> > us a
>> > smallish rounding error of "lost" efficiency, but keeps the size of
>> > extents we
>> > have to read+decompress in the overwrite or small read cases reasonable.
>> >
>>
>> Yes, this is generally what people do.  It's very hard to have a large
>> compression window without having the CPU times balloon up.
>>
>> > The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB
>> > it's still
>> > only 5-10 records, which sounds fine to me.
>> >
>> > > > The alternative is that we either
>> > > >
>> > > > a) do the read side of the overwrite in the first phase of the op,
>> > > > before we commit it.  That will mean a higher commit latency and
>> > > > will slow down the pipeline, but would avoid the double-write of the
>> > > > overlap/wal regions.  Or,
>> > > This is probably the simplest approach without hidden caveats but
>> > > latency increase.
>> > > >
>> > > > b) we could just leave the overwritten extents alone and structure
>> > > > the block_map so that they are occluded.  This will 'leak' space for
>> > > > some write patterns, but that might be okay given that we can come
>> > > > back later and clean it up, or refine our strategy to be smarter.
>> > > Just to clarify I understand the idea properly. Are you suggesting to
>> > > simply write out new block to a new extent and update block map (and
>> > > read procedure) to use that new extent or remains of the overwritten
>> > > extents depending on the read offset? And overwritten extents are
>> > > preserved intact until they are fully hidden or some background
>> > > cleanup
>> > procedure merge them.
>> > > If so I can see following pros and cons:
>> > > + write is faster
>> > > - compressed data read is potentially slower as you might need to
>> > > decompress more compressed blocks.
>> > > - space usage is higher
>> > > - need for garbage collector i.e. additional complexity
>> > >
>> > > Thus the question is what use patterns are at foreground and should be
>> > > the most effective.
>> > > IMO read performance and space saving are more important for the cases
>> > > where compression is needed.
>> > >
>> > > > What do you think?
>> > > >
>> > > > It would be nice to choose a simpler strategy for the first pass
>> > > > that handles a subset of write patterns (i.e., sequential writes,
>> > > > possibly
>> > > > unaligned) that is still a step in the direction of the more robust
>> > > > strategy we expect to implement after that.
>> > > >
>> > > I'd probably agree but.... I don't see a good way how one can
>> > > implement compression for specific write patterns only.
>> > > We need to either ensure that these patterns are used exclusively (
>> > > append only / sequential only flags? ) or provide some means to fall
>> > > back to regular mode when inappropriate write occurs.
>> > > Don't think both are good and/or easy enough.
>> >
>> > Well, if we simply don't implement a garbage collector, then for
>> > sequential+aligned writes we don't end up with stuff that needs garbage
>> > collection.  Even the sequential case might be doable if we make it
>> > possible
>> > to fill the extent with a sequence of compressed strings (as long as we
>> > haven't reached the compressed length, try to restart the decompression
>> > stream).
>> >
>> > > In this aspect my original proposal to have compression engine more or
>> > > less segregated from the bluestore seems more attractive - there is no
>> > > need to refactor bluestore internals in this case. One can easily
>> > > start using compression or drop it and fall back to the current code
>> > > state. No significant modifications in run-time data structures and
>> > algorithms....
>> >
>> > It sounds like in theory, but when I try to sort out how it would
>> > actually work,
>> > it seems like you have to either expose all of the block_map metadata up
>> > to
>> > this layer, at which point you may as well do it down in BlueStore and
>> > have
>> > the option of deferred WAL work, or you do something really simple with
>> > fixed compression block sizes and get a weak final result.  Not to
>> > mention the
>> > EC problems (although some of that will go away when EC overwrites come
>> > along)...
>> >
>> > sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> Cheers,
> ~Blairo

-- 
Cheers,
~Blairo
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html