Re: Adding compression support for bluestore.

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 17 Mar 2016 18:21:28 +0300

Blair, Allen,

I'd totally agree that we need to address these compression management 
aspects as well.
Will try to sort out that soon.

Thanks a lot for you valuable comments.

Igor

On 17.03.2016 6:21, Allen Samuels wrote:
No apology needed.

We've been totally focused on discussing the mechanism of compression and really haven't started talking about policy or statistics. We certainly can't be complete without addressing the kinds of issues  that you raise.

All of the proposed compression architectures allow the ability to selectively enable/disable compression (including presumably the selection of specific algorithm and parameters) but there's been no discussion of the specific ways to enable same. I've always imagined a default per-pool compression setting that could be overridden on a per-RADOS operation basis. This would allow the clients maximum flexibility (RGW trivially can tell us when it's already compressed the data, CephFS could have per-directory metadata, etc.) in controlling compression, etc. Details are TBD.

w.r.t. statistics, BlueStore will have high-precision compression information at the end of each write operation. No reason why this can't be reflected back up the RADOS operation chain for dynamic control (as you describe). I would like to see this information be accumulated and aggregated in order to provide static metrics also. Things like compression ratios per-pool, etc.

Clearly the implementation of compression is incomplete until these are addressed.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Blair Bethwaite [mailto:blair.bethwaite@xxxxxxxxx]
Sent: Wednesday, March 16, 2016 5:57 PM
To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>; Allen Samuels
<Allen.Samuels@xxxxxxxxxxx>; Sage Weil <sage@xxxxxxxxxxxx>
Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Subject: Re: Adding compression support for bluestore.

This time without html (thanks gmail)!

On 17 March 2016 at 09:43, Blair Bethwaite <blair.bethwaite@xxxxxxxxx>
wrote:
Hi Igor, Allen, Sage,

Apologies for the interjection into the technical back-and-forth here,
but I want to ask a question / make a request from the user/operator
perspective (possibly relevant to other advanced bluestore features too)...

Can a feature like this expose metrics (e.g., compression ratio) back
up to higher layers such as rados that could then be used to automate
use of the feature? As a user/operator implicit compression support in
the backend is exciting, but it's something I'd want rados/librbd
capable of toggling on/off automatically based on a threshold (e.g.,
librbd could toggle compression off at the image level if the first n
rados objects written/edited since turning compression on are
compressed less than c%) - this sort of thing would obviously help to
avoid unnecessary overheads and would cater to mixed use-cases (e.g.
cloud provider block storage) where in general the operator wants
compression on but has no idea what users are doing with their
internal filesystems, it'd also mesh nicely with any future
"distributed"-compression implemented at the librbd client-side (which
would again likely be an rbd toggle).
Cheers,

On 17 March 2016 at 06:41, Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
wrote:
-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
Sent: Wednesday, March 16, 2016 2:28 PM
To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>
Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: Re: Adding compression support for bluestore.

On Wed, 16 Mar 2016, Igor Fedotov wrote:
On 15.03.2016 20:12, Sage Weil wrote:
My current thinking is that we do something like:

- add a bluestore_extent_t flag for FLAG_COMPRESSED
- add uncompressed_length and compression_alg fields
(- add a checksum field we are at it, I guess)

- in _do_write, when we are writing a new extent, we need to
compress it in memory (up to the max compression block), and
feed that size into _do_allocate so we know how much disk space
to allocate.  this is probably reasonably tricky to do, and
handles just the simplest case (writing a new extent to a new
object, or appending to an existing one, and writing the new data
compressed).
The current _do_allocate interface and responsibilities will
probably need
to change quite a bit here.
sounds good so far
- define the general (partial) overwrite strategy.  I would
like for this to be part of the WAL strategy.  That is, we do
the read/modify/write as deferred work for the partial regions
that overlap
existing extents.
Then _do_wal_op would read the compressed extent, merge it with
the new piece, and write out the new (compressed) extents.  The
problem is that right now the WAL path *just* does IO--it
doesn't do any kv metadata updates, which would be required
here to do the final allocation (we won't know how big the
resulting extent will be until we decompress the old thing,
merge it with the new thing, and
recompress).
But, we need to address this anyway to support CRCs (where we
will similarly do a read/modify/write, calculate a new
checksum, and need to update the onode).  I think the answer
here is just that the _do_wal_op updates some in-memory-state
attached to the wal operation that gets applied when the wal
entry is cleaned up in _kv_sync_thread (wal_cleaning list).

Calling into the allocator in the WAL path will be more
complicated than just updating the checksum in the onode, but I
think it's doable.
Could you please name the issues for calling allocator in WAL path?
Proper locking? What else?
I think this bit isn't so bad... we need to add another field to
the in-memory wal_op struct that includes space allocated in the
WAL stage, and make sure that gets committed by the kv thread for
all of the wal_cleaning txc's.

A potential issue with using WAL for compressed block overwrites
is significant WAL data volume increase. IIUC currently WAL
record can have up to 2*bluestore_min_alloc_size (i.e. 128K)
client data per single write request - overlapped head and tail.
In case of compressed blocks this will be up to
2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
overwrite fully overlapped extents - one should operate
compression
blocks now...
Seems attractive otherwise...
I think the way to address this is to make
bluestore_max_compressed_block
*much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That
gives us a smallish rounding error of "lost" efficiency, but keeps
the size of extents we have to read+decompress in the overwrite or
small read cases reasonable.

Yes, this is generally what people do.  It's very hard to have a
large compression window without having the CPU times balloon up.

The tradeoff is the onode_t's block_map gets bigger... but for a
~4MB it's still only 5-10 records, which sounds fine to me.

The alternative is that we either

a) do the read side of the overwrite in the first phase of the
op, before we commit it.  That will mean a higher commit
latency and will slow down the pipeline, but would avoid the
double-write of the overlap/wal regions.  Or,
This is probably the simplest approach without hidden caveats but
latency increase.
b) we could just leave the overwritten extents alone and
structure the block_map so that they are occluded.  This will
'leak' space for some write patterns, but that might be okay
given that we can come back later and clean it up, or refine our
strategy to be smarter.
Just to clarify I understand the idea properly. Are you
suggesting to simply write out new block to a new extent and
update block map (and read procedure) to use that new extent or
remains of the overwritten extents depending on the read offset?
And overwritten extents are preserved intact until they are fully
hidden or some background cleanup
procedure merge them.
If so I can see following pros and cons:
+ write is faster
- compressed data read is potentially slower as you might need to
decompress more compressed blocks.
- space usage is higher
- need for garbage collector i.e. additional complexity

Thus the question is what use patterns are at foreground and
should be the most effective.
IMO read performance and space saving are more important for the
cases where compression is needed.

What do you think?

It would be nice to choose a simpler strategy for the first
pass that handles a subset of write patterns (i.e., sequential
writes, possibly
unaligned) that is still a step in the direction of the more
robust strategy we expect to implement after that.

I'd probably agree but.... I don't see a good way how one can
implement compression for specific write patterns only.
We need to either ensure that these patterns are used exclusively
( append only / sequential only flags? ) or provide some means to
fall back to regular mode when inappropriate write occurs.
Don't think both are good and/or easy enough.
Well, if we simply don't implement a garbage collector, then for
sequential+aligned writes we don't end up with stuff that needs
sequential+garbage
collection.  Even the sequential case might be doable if we make it
possible to fill the extent with a sequence of compressed strings
(as long as we haven't reached the compressed length, try to
restart the decompression stream).

In this aspect my original proposal to have compression engine
more or less segregated from the bluestore seems more attractive
- there is no need to refactor bluestore internals in this case.
One can easily start using compression or drop it and fall back
to the current code state. No significant modifications in
run-time data structures and
algorithms....

It sounds like in theory, but when I try to sort out how it would
actually work, it seems like you have to either expose all of the
block_map metadata up to this layer, at which point you may as well
do it down in BlueStore and have the option of deferred WAL work,
or you do something really simple with fixed compression block
sizes and get a weak final result.  Not to mention the EC problems
(although some of that will go away when EC overwrites come
along)...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in the body of a message to majordomo@xxxxxxxxxxxxxxx More
majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
Cheers,
~Blairo

--
Cheers,
~Blairo

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html