Re: Adding Data-At-Rest compression support to Ceph

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Wed, 23 Sep 2015 18:26:04 +0300

On 23.09.2015 17:05, Gregory Farnum wrote:
On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Wed, 23 Sep 2015, Igor Fedotov wrote:
Hi Sage,
thanks a lot for your feedback.

Regarding issues with offset mapping and stripe size exposure.
What's about the idea to apply compression in two-tier (cache+backing storage)
model only ?
I'm not sure we win anything by making it a two-tier only thing... simply
making it a feature of the EC pool means we can also address EC pool users
like radosgw.

I doubt single-tier one is widely used for EC pools since there is no random
write support in such mode. Thus this might be an acceptable limitation.
At the same time it seems that appends caused by cached object flush have
fixed block size (8Mb by default). And object is totally rewritten on the next
flush if any. This makes offset mapping less tricky.
Decompression should be applied in any model though as cache tier shutdown and
subsequent compressed data access is possibly  a valid use case.
Yeah, we need to handle random reads either way, so I think the offset
mapping is going to be needed anyway.
The idea of making the primary responsible for object compression
really concerns me. It means for instance that a single random access
will likely require access to multiple objects, and breaks many of the
optimizations we have right now or in the pipeline (for instance:
direct client access).
Could you please elaborate why multiple objects access is required on 
single random access?
In my opinion we need to access absolutely the same object set as 
before: in EC pool each appended block is spitted into multiple shards 
that go to respective OSDs. In general case one has to retrieve a set of 
adjacent shards from several OSDs on single read request. In case of 
compression the only difference is in data range that compressed shard 
set occupy. I.e. we simply need to translate requested data range to the 
actually stored one and retrieve that data from OSDs. What's missed?
And apparently only the EC pool will support
compression, which is frustrating for all the replicated pool users
out there...
In my opinion  replicated pool users should consider EC pool usage first 
if they care about space saving. They automatically gain 50% space 
saving this way. Compression brings even more saving but that's rather 
the second step on this way.
Is there some reason we don't just want to apply encryption across an
OSD store? Perhaps doing it on the filesystem level is the wrong way
(for reasons named above) but there are other mechanisms like inline
block device compression that I think are supposed to work pretty
well.
If I understand the idea of inline block device compression correctly it 
has some of drawbacks similar to FS compression approach. Ones to mention:
* Less flexibility - per device compression only, no way to have 
per-pool compression. No control on the compression process.
* Potentially higher overhead when operating- There is no way to bypass 
non-compressible data processing, e.g. shards with Erasure codes.
* Potentially higher overhead for recovery on OSD death - one needs to 
decompress data at working OSDs and compress it at new OSD. That's not 
necessary if compression takes place prior to EC though.
The only thing that doesn't get us that I can see mentioned here
is the over-the-wire compression — and Haomai already has patches for
that, which should be a lot easier to validate and will work at all
levels of the stack!
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html