Compression implementation options

Igor Fedotov <ifedotov@xxxxxxxxxxxx> · Mon, 28 Sep 2015 18:41:55 +0300

Hi folks,

Here is a brief summary on potential compression implementation options.
I think we should choose the desired approach prior to start working on 
the compression feature.

Comments, additions and fixes are welcome.

Compression At Client - compression/decompression to be performed at the 
client level (most preferably - Rados) before sending/after receiving 
data to/from Ceph.
    Pros:
        * Ceph cluster isn’t loaded with additional computation burden.
        * All Ceph cluster components and data transfers benefit from 
reduce data volume.
        * Compression is transparent to Ceph cluster components
    Cons:
        * Weak clients can lack CPU resources to handle their traffic.
        * Any Read/Write access requires at least two sequential 
requests to Ceph cluster to get data: the first one to retrieve 
“original to compressed“ offset mapping for desired data block, the 
second one to get compressed data block.
        * Random write access handling is tricky (see notes below). 
Even more requests to the cluster per single user one might be needed in 
this case.

Compression At Replicated Pool - compression to be performed at primary 
Ceph entities at Replicated Pool level prior to data replication.
    Pros:
        * Clients benefit from cluster CPU resources utilization.
        * Compression for specific data block is performed at a single 
point only - thus total CPU utilization for Ceph cluster is less.
        * Underlying Ceph components and data transfers benefit from 
from reduced data volume.
    Cons:
        * Clients that use EC pools directly lack compression unless 
it’s implemented there too.
        * In two-tier model data compression at cache tier may be 
inappropriate due to performance reasons. Compression at cache tier also 
prevents from cache removal when/if needed.
        * Random write access handling is tricky (see notes below).

Compression At Erasure Coded pool - compression to be performed at 
primary Ceph entities at EC Pool level prior to Erasure Coding.
    Pros:
        * Clients benefit from cluster CPU resources utilization.
        * Erasure Coding “inflates” processed data block (up to ~50%). 
Thus doing compression prior to that reduces CPU utilization.
        * Natural combination with EC means. Compression and EC have 
similar purposes - save storage space at the cost of CPU usage. One can 
reuse EC infrastructure and design solutions.
        * No need for random write access support - EC pools don’t 
provide that on its own. Thus we can reuse the same approach to resolve 
the issue when needed. Implementation becomes much easier.
        * Underlying Ceph components and data transfers benefit from 
reduced data volume.
    Cons:
        * Limited applicability - clients that don’t use EC pools lack 
compression.

Compression At Ceph Filestore entity - compression to be performed by 
Ceph File Store component prior to saving object data to underlying file 
system.
    Pros:
        *Clients benefit from cluster CPU resources utilization.

    Cons:
        * Random write access is tricky (see notes below).
        * From cluster perspective compression is performed either on 
each replicated block or on a block “inflated” by erasure coding. Thus 
total Ceph cluster CPU utilization to perform compression becomes 
considerably higher ( three times increase for replicated pools and ~50% 
one for EC pools).
        * No benefit in reduced data transfers over the net.
        * Recovery procedure caused by OSD down triggers complete data 
set decompression and compression when EC pool used. This might 
considerably increase CPU usage utilization for recovery process.

Compression Externally at File System - compression to be performed at 
File Store node by means of underlying file system.
    Pros:
        * Compression is (mostly) transparent to Ceph
        * Clients benefit from cluster CPU resources utilization.
    Cons:
        * File system “lock-in”. One can use BTRFS file system only for 
now. Its production readiness is questionable.
        * Limited flexibility - compression is a partition/mount point 
property. Hard to have better granularity - on per-pool or per-object. 
No way to disable compression.
        * From cluster perspective compression is performed either on 
each replicated block or on a block “inflated” by erasure coding. Thus 
total Ceph cluster CPU utilization to perform compression becomes 
considerably higher ( three times increase for replicated pools and ~50% 
one for EC pools).
        * No benefit in reduced data transfers over the net.
        * Recovery procedure caused by OSD down triggers complete data 
set decompression and compression when EC pool used. This might 
considerably increase CPU usage utilization for recovery process.

Compression Externally at Block Device - compression to be performed at 
File Store node by means of underlying block device that supports inline 
data compression.
    Pros:
        * Compression is transparent to Ceph
        * Clients benefit from cluster CPU resources utilization.
    Cons:
        * Production quality solution seems to be absent.
        * Limited flexibility - compression is a partition/mount point 
property. Hard to have better granularity - on per-pool or per-object. 
No way to disable compression.
        * From cluster perspective compression is performed either on 
each replicated block or on a block “inflated” by erasure coding. Thus 
total Ceph cluster CPU utilization to perform compression becomes 
considerably higher ( three times increase for replicated pools and ~50% 
one for EC pools).
        * No benefit in reduced data transfers over the net.
        * Recovery procedure caused by OSD down triggers complete data 
set decompression and compression when EC pool used. This might 
considerably increase CPU usage utilization for recovery process.

Notes:
Probably the most troublesome issue brought by compression introduction 
is random write access handling. Please note that  Its brief overview is 
as follows:
Compressing entity processes original data blocks for a specific object 
and eventually saves a set of new compressed blocks to the storage. 
Since different blocks can have different compression ratio new block 
are variable in size. When a new write request for specific data range 
overlapping existing data comes from the client one needs to save 
resulting compressed block some way. Again due to different compression 
ratio new block may not fit into the space allocated for the previous 
one. Moreover if new write request isn’t aligned with the original one 
we might face the case when previous block is invalidated partially.
Thus the flat and sequential object data keeping model doesn’t work any 
more.
Instead one needs to introduce some trick scheme to store, access and 
overwrite object content. One can find more details on both the issue 
and potential implementation approach here ( sections I & II):
http://users.ics.forth.gr/~bilas/pdffiles/makatos-snapi10.pdf

Thanks,
Igor.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html