Re: Adding Data-At-Rest compression support to Ceph

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 23 Sep 2015 06:15:20 -0700 (PDT)

On Wed, 23 Sep 2015, Igor Fedotov wrote:
> Hi Sage,
> thanks a lot for your feedback.
> 
> Regarding issues with offset mapping and stripe size exposure.
> What's about the idea to apply compression in two-tier (cache+backing storage)
> model only ?

I'm not sure we win anything by making it a two-tier only thing... simply 
making it a feature of the EC pool means we can also address EC pool users 
like radosgw.

> I doubt single-tier one is widely used for EC pools since there is no random
> write support in such mode. Thus this might be an acceptable limitation.
> At the same time it seems that appends caused by cached object flush have
> fixed block size (8Mb by default). And object is totally rewritten on the next
> flush if any. This makes offset mapping less tricky.
> Decompression should be applied in any model though as cache tier shutdown and
> subsequent compressed data access is possibly  a valid use case.

Yeah, we need to handle random reads either way, so I think the offset 
mapping is going to be needed anyway.  And I don't think there is any 
real difference from teh EC pool's perspective between a direct user 
like radosgw and the cache tier writing objects--in both cases it's 
doing appends and deletes.

sage

> 
> Thanks,
> Igor
> 
> On 22.09.2015 22:11, Sage Weil wrote:
> > On Tue, 22 Sep 2015, Igor Fedotov wrote:
> > > Hi guys,
> > > 
> > > I can find some talks about adding compression support to Ceph. Let me
> > > share
> > > some thoughts and proposals on that too.
> > > 
> > > First of all I?d like to consider several major implementation options
> > > separately. IMHO this makes sense since they have different applicability,
> > > value and implementation specifics. Besides that less parts are easier for
> > > both understanding and implementation.
> > > 
> > >    * Data-At-Rest Compression. This is about compressing basic data volume
> > > kept
> > > by the Ceph backing tier. The main reason for that is data store costs
> > > reduction. One can find similar approach introduced by Erasure Coding Pool
> > > implementation - cluster capacity increases (i.e. storage cost reduces) at
> > > the
> > > expense of additional computations. This is especially effective when
> > > combined
> > > with the high-performance cache tier.
> > >    *  Intermediate Data Compression. This case is about applying
> > > compression
> > > for intermediate data like system journals, caches etc. The intention is
> > > to
> > > improve expensive storage resource  utilization (e.g. solid state drives
> > > or
> > > RAM ). At the same time the idea to apply compression ( feature that
> > > undoubtedly introduces additional overhead ) to the crucial heavy-duty
> > > components probably looks contradictory.
> > >    *  Exchange Data ?ompression. This one to be applied to messages
> > > transported
> > > between client and storage cluster components as well as internal cluster
> > > traffic. The rationale for that might be the desire to improve cluster
> > > run-time characteristics, e.g. limited data bandwidth caused by the
> > > network or
> > > storage devices throughput. The potential drawback is client overburdening
> > > -
> > > client computation resources might become a bottleneck since they take
> > > most of
> > > compression/decompression tasks.
> > > 
> > > Obviously it would be great to have support for all the above cases, e.g.
> > > object compression takes place at the client and cluster components handle
> > > that naturally during the object life-cycle. Unfortunately significant
> > > complexities arise on this way. Most of them are related to partial object
> > > access, both reading and writing. It looks like huge development (
> > > redesigning, refactoring and new code development ) and testing efforts
> > > are
> > > required on this way. It?s hard to estimate the value of such aggregated
> > > support at the current moment too.
> > > Thus the approach I?m suggesting is to drive the progress eventually and
> > > consider cases separately. At the moment my proposal is to add
> > > Data-At-Rest
> > > compression to Erasure Coded pools as the most definite one from both
> > > implementation and value points of view.
> > > 
> > > How we can do that.
> > > 
> > > Ceph Cluster Architecture suggests two-tier storage model for production
> > > usage. Cache tier built on high-performance expensive storage devices
> > > provides
> > > performance. Storage tier with low-cost less-efficient devices provides
> > > cost-effectiveness and capacity. Cache tier is supposed to use ordinary
> > > data
> > > replication while storage one can use erasure coding (EC) for effective
> > > and
> > > reliable data keeping. EC provides less store costs with the same
> > > reliability
> > > comparing to data replication approach at the expenses of additional
> > > computations. Thus Ceph already has some trade off between capacity and
> > > computation efforts. Actually Data-At-Rest compression is exactly about
> > > the
> > > same. Moreover one can tie EC and Data-At-Rest compression together to
> > > achieve
> > > even better storage effectiveness.
> > > There are two possible ways on adding Data-At-Rest compression:
> > >    *  Use data compression built into a file system beyond the Ceph.
> > >    *  Add compression to Ceph OSD.
> > > 
> > > At first glance Option 1. looks pretty attractive but there are some
> > > drawbacks
> > > for this approach. Here they are:
> > >    *  File System lock-in. BTRFS is the only file system supporting
> > > transparent
> > > compression among ones recommended for Ceph usage.
> > > Moreover
> > > AFAIK it?s still not recommended for production usage, see:
> > > http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
> > >     *  Limited flexibility - one can use compression methods and policies
> > > supported by FS only.
> > >     *  Data compression depends on volume or mount point properties (and
> > > is
> > > bound to OSD). Without additional support Ceph lacks the ability to have
> > > different compression policies for different pools residing at the same
> > > OSD.
> > >     *  File Compression Control isn?t standardized among file systems. If
> > > (or
> > > when) new compression-equipped File System appears Ceph might require
> > > corresponding changes to handle that properly.
> > > 
> > > Having compression at OSD helps to eliminate these drawbacks.
> > > As mentioned above Data-At-Rest compression purposes are pretty the same
> > > as
> > > for Erasure Coding. It looks quite easy to add compression support to EC
> > > pools. This way one can have even more storage space for higher CPU load.
> > > Additional Pros for combining compression and erasure coding are:
> > >    *  Both EC and compression have complexities in partial writing. EC
> > > pools
> > > don?t have partial write support (data append only) and the solution for
> > > that
> > > is cache tier insertion.  Thus we can transparently reuse the same
> > > approach in
> > > case of compression.
> > >    *  Compression becomes a pool property thus Ceph users will have direct
> > > control what pools to apply compression with.
> > >    *  Original write performance isn?t impacted by the compression for
> > > two-tier
> > > model - write data goes to the cache uncompressed and there is no
> > > corresponding compression latency. Actual compression happens in
> > > background
> > > when backing storage filling takes place.
> > >    *  There is an additional benefit in network bandwidth saving when
> > > primary
> > > OSD performs a compression as resulting object shards for replication are
> > > less.
> > >    *  Data-at-rest compression can also bring an additional performance
> > > improvement for HDD-based storage. Reducing the amount of data written to
> > > slow
> > > media can provide a net performance improvement even taking into account
> > > the
> > > compression overhead.
> > I think this approach makes a lot of sense.  The tricky bit will be
> > storing the additional metadata that maps logical offsets to compressed
> > offsets.
> > 
> > > Some implementation notes:
> > > 
> > > The suggested approach is to perform data compression prior to Erasure
> > > Coding
> > > to reduce data portion passed to coding and avoid the need to introduce
> > > additional means to disable EC-generated chunks compression.
> > At first glance, the compress-before-ec approach sounds attractive: the
> > complex EC striping stuff doesn't need to change, and we just need to map
> > logical offsets to compressed offsets before doing the EC read/reconstruct
> > as we normally would.  The problem is with appends: the EC stripe size
> > is exposed to the user and they write in those increments.  So if we
> > compress before we pass it to EC, then we need to have variable stripe
> > sizes for each write (depending on how well it compressed).  The upshot
> > here is that if we end up support variable EC stripe sizes we *could*
> > allow librados appends of any size (not just the stripe size as we
> > currently do).  I'm not sure how important/useful that is...
> > 
> > On the other hand, ec-before-compression still means we need to map coded
> > stripe offsets to compressed offsets.. and you're right that it puts a bit
> > more data through the EC transform.
> > 
> > Either way, it will be a reasonably complex change.
> > 
> > > Data-At-Rest compression should support plugin architecture to enable
> > > multiple
> > > compression backends.
> > Haomai has started some simple compression infrastructure to support
> > compression over the wire; see
> > 
> > 	https://github.com/ceph/ceph/pull/5116
> > 
> > We should reuse or extend the plugin interface there to cover both users.
> > 
> > > Compression engine should mark stored objects with some tags to indicate
> > > if
> > > compression took place and what algorithm was used.
> > > To avoid (reduce) backing storage CPU overload caused by
> > > compression/decompression ( e.g. this can happen during massive reads ) we
> > > can
> > > introduce additional means to detect such situations and temporary disable
> > > compression for current write requests. Since there is way to mark objects
> > > as
> > > compressed/uncompressed this produces almost no issues for future
> > > handling.
> > > Hardware compression support usage, e.g. Intel QuickAssist can be an
> > > additional helper for this issue.
> > Great to see this moving forward!
> > sage
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html