Hi folks,
Here is a brief summary on potential compression implementation options.
I think we should choose the desired approach prior to start working on
the compression feature.
Comments, additions and fixes are welcome.
Compression At Client - compression/decompression to be performed at the
client level (most preferably - Rados) before sending/after receiving
data to/from Ceph.
Pros:
* Ceph cluster isn’t loaded with additional computation burden.
* All Ceph cluster components and data transfers benefit from
reduce data volume.
* Compression is transparent to Ceph cluster components
Cons:
* Weak clients can lack CPU resources to handle their traffic.
* Any Read/Write access requires at least two sequential
requests to Ceph cluster to get data: the first one to retrieve
“original to compressed“ offset mapping for desired data block, the
second one to get compressed data block.
* Random write access handling is tricky (see notes below).
Even more requests to the cluster per single user one might be needed in
this case.
Compression At Replicated Pool - compression to be performed at primary
Ceph entities at Replicated Pool level prior to data replication.
Pros:
* Clients benefit from cluster CPU resources utilization.
* Compression for specific data block is performed at a single
point only - thus total CPU utilization for Ceph cluster is less.
* Underlying Ceph components and data transfers benefit from
from reduced data volume.
Cons:
* Clients that use EC pools directly lack compression unless
it’s implemented there too.
* In two-tier model data compression at cache tier may be
inappropriate due to performance reasons. Compression at cache tier also
prevents from cache removal when/if needed.
* Random write access handling is tricky (see notes below).
Compression At Erasure Coded pool - compression to be performed at
primary Ceph entities at EC Pool level prior to Erasure Coding.
Pros:
* Clients benefit from cluster CPU resources utilization.
* Erasure Coding “inflates” processed data block (up to ~50%).
Thus doing compression prior to that reduces CPU utilization.
* Natural combination with EC means. Compression and EC have
similar purposes - save storage space at the cost of CPU usage. One can
reuse EC infrastructure and design solutions.
* No need for random write access support - EC pools don’t
provide that on its own. Thus we can reuse the same approach to resolve
the issue when needed. Implementation becomes much easier.
* Underlying Ceph components and data transfers benefit from
reduced data volume.
Cons:
* Limited applicability - clients that don’t use EC pools lack
compression.
Compression At Ceph Filestore entity - compression to be performed by
Ceph File Store component prior to saving object data to underlying file
system.
Pros:
*Clients benefit from cluster CPU resources utilization.
Cons:
* Random write access is tricky (see notes below).
* From cluster perspective compression is performed either on
each replicated block or on a block “inflated” by erasure coding. Thus
total Ceph cluster CPU utilization to perform compression becomes
considerably higher ( three times increase for replicated pools and ~50%
one for EC pools).
* No benefit in reduced data transfers over the net.
* Recovery procedure caused by OSD down triggers complete data
set decompression and compression when EC pool used. This might
considerably increase CPU usage utilization for recovery process.
Compression Externally at File System - compression to be performed at
File Store node by means of underlying file system.
Pros:
* Compression is (mostly) transparent to Ceph
* Clients benefit from cluster CPU resources utilization.
Cons:
* File system “lock-in”. One can use BTRFS file system only for
now. Its production readiness is questionable.
* Limited flexibility - compression is a partition/mount point
property. Hard to have better granularity - on per-pool or per-object.
No way to disable compression.
* From cluster perspective compression is performed either on
each replicated block or on a block “inflated” by erasure coding. Thus
total Ceph cluster CPU utilization to perform compression becomes
considerably higher ( three times increase for replicated pools and ~50%
one for EC pools).
* No benefit in reduced data transfers over the net.
* Recovery procedure caused by OSD down triggers complete data
set decompression and compression when EC pool used. This might
considerably increase CPU usage utilization for recovery process.
Compression Externally at Block Device - compression to be performed at
File Store node by means of underlying block device that supports inline
data compression.
Pros:
* Compression is transparent to Ceph
* Clients benefit from cluster CPU resources utilization.
Cons:
* Production quality solution seems to be absent.
* Limited flexibility - compression is a partition/mount point
property. Hard to have better granularity - on per-pool or per-object.
No way to disable compression.
* From cluster perspective compression is performed either on
each replicated block or on a block “inflated” by erasure coding. Thus
total Ceph cluster CPU utilization to perform compression becomes
considerably higher ( three times increase for replicated pools and ~50%
one for EC pools).
* No benefit in reduced data transfers over the net.
* Recovery procedure caused by OSD down triggers complete data
set decompression and compression when EC pool used. This might
considerably increase CPU usage utilization for recovery process.
Notes:
Probably the most troublesome issue brought by compression introduction
is random write access handling. Please note that Its brief overview is
as follows:
Compressing entity processes original data blocks for a specific object
and eventually saves a set of new compressed blocks to the storage.
Since different blocks can have different compression ratio new block
are variable in size. When a new write request for specific data range
overlapping existing data comes from the client one needs to save
resulting compressed block some way. Again due to different compression
ratio new block may not fit into the space allocated for the previous
one. Moreover if new write request isn’t aligned with the original one
we might face the case when previous block is invalidated partially.
Thus the flat and sequential object data keeping model doesn’t work any
more.
Instead one needs to introduce some trick scheme to store, access and
overwrite object content. One can find more details on both the issue
and potential implementation approach here ( sections I & II):
http://users.ics.forth.gr/~bilas/pdffiles/makatos-snapi10.pdf
Thanks,
Igor.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html