Re: Adding Data-At-Rest compression support to Ceph

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Thu, 24 Sep 2015 18:13:34 +0300

On 23.09.2015 21:03, Gregory Farnum wrote:
On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:

The idea of making the primary responsible for object compression
really concerns me. It means for instance that a single random access
will likely require access to multiple objects, and breaks many of the
optimizations we have right now or in the pipeline (for instance:
direct client access).
Could you please elaborate why multiple objects access is required on single
random access?
It sounds to me like you were planning to take an incoming object
write, compress it, and then chunk it. If you do that, the symbols
("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
to reside in the first object and need to be fetched for each read in
other objects.
Gregory,
do you mean a kind of compressor dictionary under symbols "abcdefgh = 
a", etc here.
And your assumption is that such dictionary is made on the first write, 
saved and reused by any subsequent reads, right?
I think that's not the case - it's better to compress each write 
independently.  Thus there is no need to access "dictionary" object ( 
i.e. the first object with these symbols) on every read operation,. The 
latter uses compressed block data only.
Yes, this might affect total compression ratio but thinks that's acceptabl.
In my opinion we need to access absolutely the same object set as before: in
EC pool each appended block is spitted into multiple shards that go to
respective OSDs. In general case one has to retrieve a set of adjacent
shards from several OSDs on single read request.
Usually we just need to get the object info from the primary and then
read whichever object has the data for the requested region. If the
region spans a stripe boundary we might need to get two, but often we
don't...
With independent block compression mentioned above the scenario is the 
same. The only thing we need to find proper compressed block is a 
mapping from original data offset to the compressed ones. We can store 
this as object metadata. Thus we need object metadata on each read only.
In case of compression the
only difference is in data range that compressed shard set occupy. I.e. we
simply need to translate requested data range to the actually stored one and
retrieve that data from OSDs. What's missed?
And apparently only the EC pool will support
compression, which is frustrating for all the replicated pool users
out there...
In my opinion  replicated pool users should consider EC pool usage first if
they care about space saving. They automatically gain 50% space saving this
way. Compression brings even more saving but that's rather the second step
on this way.
EC pools have important limitations that replicated pools don't, like
not working for object classes or allowing random overwrites. You can
stick a replicated cache pool in front but that comes with another
whole can of worms. Anybody with a large enough proportion of active
data won't find that solution suitable but might still want to reduce
space required where they can, like with local compression.
Well I agree that have compression support for both replicated and EC 
pools is better.
But random access ( and probably other advanced features ) requires much 
more complex data handling that also brings additional overhead. 
Actually I suppose EC pools have such limitations due to these reasons. 
Thus my original idea was to simplify compression implementation from 
one side and make it  in-line with EC usage from another. The latter 
makes sense since compression and EC  have pretty the same reasons for 
implementation.

And just for the sake of my education could you please mention or point 
out existing issues in cache+EC pools usage.
How widely are EC pools used in production at all? Or that's rather 
experimental/secondary option?
Is there some reason we don't just want to apply encryption across an
OSD store? Perhaps doing it on the filesystem level is the wrong way
(for reasons named above) but there are other mechanisms like inline
block device compression that I think are supposed to work pretty
well.
If I understand the idea of inline block device compression correctly it has
some of drawbacks similar to FS compression approach. Ones to mention:
* Less flexibility - per device compression only, no way to have per-pool
compression. No control on the compression process.
What would the use case be here? I can imagine not wanting to slow
down your cache pools with it or something (although realistically I
don't think that's a concern unless the sheer CPU usage is a problem
with frequent writes), but those would be on separate OSDs/volumes
anyway
Well I can imagine the need to have compression for some specific 
backing pools ( e.g. with seldom accessed or highly compressible data) 
and disable it for others, e.g. where original data is non-compressible 
( e.g. either already compressed  or encrypted).
Potentially we can even have some option to control compression on 
per-object basis and provide some hints for clients to enable it for 
specific use cases.
Another feature that might be useful - the ability to disable/re-enable 
compression during OSD life-cycle. E.g. when Administrator realizes that 
it's not appropriate for his use case. I doubt that's easy to do when 
compression is performed at device level.

Plus block device compression is also able to include all the *other*
stuff that doesn't fit inside the object proper (xattrs and omap).
Yes, that's a good point but I suppose nothing prevents us from 
compressing metadata by ourselves too.
* Potentially higher overhead when operating- There is no way to bypass
non-compressible data processing, e.g. shards with Erasure codes.
My information theory intuition has never been very good, but I don't
think the coded chunks are any less compressible than the data they're
coding for, in general...
Yes, my bad. I played with EC a bit - generated chunks are pretty 
regular. I expected something absolutely random like encrypted data.

...I should note that I'm under the impression that transparent
compression already exists at some level which can be stacked with
regular filesystems, but I'm not finding it now, so maybe I'm
misinformed and the tradeoffs are a little different than I thought.
I found some mentions about RBD device that performs inline compression. 
But pretty limited information present on the Net makes me think that 
this solution is far from production usage.

But I still don't like the idea of doing it on a primary just for EC
pools – I think if we were going to take that approach it'd be easier
to compress somewhere before it reaches the EC/replicated split?
As I mentioned above the main reasons that pushed me to merge 
compression with EC pools are similar handling issues and their missions 
( space for cpu)  they provide.
Moving compression to any different place raises many-many complications..

Anyway will try to make some summary on the suggested approaches and 
their Pros and Cons.

Thanks,
Igor

PS. Gregory I highly appreciate your feedback.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html