Re: Request for Info: bluestore_compression_mode?

Frank Schilder <frans@xxxxxx> · Wed, 10 Aug 2022 15:08:50 +0000

Hi Mark.

> I actually had no idea that you needed both the yaml option
> and the pool option configured

I guess you are referring to ceph-adm deployments, which I'm not using. In the ceph config data base, both options mush be enabled irrespective of how this happens (I separate the application ceph from deployment systems, which may or may not have their own logic). There was a longer thread started by me some years ago where someone posted a matrix of how both mode settings interact and what the resulting mode is.

Our applications are ceph fs data pools and rbd data pools, all EC pools. This places some heavy requirements on the compression methods in order not to kill IOPs performance completely. I don't know what your long-term goal with this is, just simplify some internals or achieve better storage utilisation. However, something like compression of entire files will probably kill performance to such a degree that it becomes useless.

I am not sure if you can get much better results out of changes. It looks like you could spend a lot of time on it and gain little. Maybe I can draw your attention to a problem that might lead to much more valuable improvements for both, pool performance and disk utilisation. This is going to be a bit longer, I need to go into details. I hope you find the time to read on.

Ceph has a couple of problems with its EC implementation. One problem that I have never seen discussed so far is the inability of its data stores to perform tail merging. I have opened a feature request (https://tracker.ceph.com/issues/56949) that describes the symptom and only requests a way to account for the excess usage. In the last sentence I mention that tail merging would be the real deal.

The example given there shows how extreme the problem can materialise. Here is the situation as of today while running my benchmark:

status usage: Filesystem                                 Size  Used Avail Use% Mounted on
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/data  2.0T  276G  1.8T  14% /mnt/cephfs
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/      2.5T  2.1T  419G  84% /mnt/adm/cephfs

Only /data contains any data. The first line shows the ceph.dir.rbytes=276G while the second line shows the pool usage 2.1T. The discrepancy due to small files is more than a factor 7. Compression is enabled, but you won't gain much here because most files are below compression_min_blob_size.

I know that the "solution" to this (and the EC overwrite amplification problem) was chosen to be bluestore_min_alloc_size=4K for all types of devices, which comes with its own problems due to the huge rocks db required and was therefore postponed in octopus. I wonder how this will work on our 18TB hard drives. I personally am not convinced that this is a path of success and, while it reduces the problem of not having tail merging, it does not really remove the need for tail merging. Even on an k=8 EC profile, 4*8=32K is a quite large unit of atomicity. On geo-replicated EC pools even larger values of k are the standard.

Are there any discussions and/or ideas on how to address this? ... in different ways?

There was also a discussion about de-duplication. Are there any news in this direction?

The following is speculative, based on incomplete knowledge:

An idea I would consider worthwhile is a separation of physical disk allocation from logical object allocation and using full read-modify-copy-on-write cycles for EC overwrites. The blob allocation size on disk should be tailored to accommodate the optimal IO size of the device, which is often larger than 4K or even 64K. All PG operations like rebalance and recovery should always operate on entire blobs regardless of what is stored in them.

Object allocation then happens on a second layer where whenever possible entire blobs are allocated (just like today, one address per blob). However, for small objects (or tails) a dedicated set of such blobs should allow addressing smaller chunks. Such blobs should offer 2, 4, 8, ... equal sized chunks for object allocation. This would require organising blobs in levels of sub-addresses available, possibly following PG stats about actual allocation sizes.

What do we gain with this? Rebalance and recovery will profit dramatically. They don't operate on object level, which is a huge pain for many small objects. Instead, they operate on blob level, which is big chunks. All objects (shards) in a single blob will be moved/recovered in a single operation instead object by object. This is particularly important for pure meta-data objects, for example, the backtrace objects created by ceph fs on the primary data pool. I found on our cluster that the bottleneck for recovery are small objects (even on SSD!), not the amount of data. If this bottleneck could be removed, it would be a huge improvement.

I guess that this idea goes to the fundamentals of ceph, namely that rados operates on object level, which means that this separation of storing data into two instead of one logical layer would require a fundamental change of rados and cannot be done on OSD level alone (the objects that rados sees may no longer be the same objects that a user sees). This change might be so fundamental that there is no upgrade path without data migration.

On the other hand, a dead end for development requiring such an upgrade path is likely to come any ways.

I hope this was not a total waste of time.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 09 August 2022 16:56:19
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Request for Info: bluestore_compression_mode?

Hi Frank,

Thank you very much for the reply!  If you don't mind me asking, what's
the use case?  We're trying to determine if we might be able to do
compression at a higher level than blob with the eventual goal of
simplifying the underlying data structures.  I actually had no idea that
you needed both the yaml option and the pool option configured (I
figured the pool option just overrode the yaml).  That's definitely
confusing!

Not sure what the right path is here or if we should even make any
significant changes at this point, but we figured that the first step
was to figure out if people are using it and how.

Mark

On 8/9/22 04:11, Frank Schilder wrote:
> Hi Mark,
>
> we are using per-pool aggressive compression mode on any EC data pool. We need it per pool as we also have un-compressed replicated meta data pools sharing the same OSDs. Currently, one needs to enable both for data compression, the bluestore option to enable compression on an OSD and the pool option to enable compression for a pool. Only when both options are active simultaneously is data actually compressed, which led to quite a bit of confusion in the past. I think per-pool compression should be sufficient and imply compression without further tweaks on the OSD side. I don't know what the objective with per-OSD bluestore compression was. We just enabled bluestore compression globally since the pool option selects the data for compression and its the logical way to select and enforce compression (per data type).
>
> Just an enable/disable setting for pools would be sufficient (enabled=aggressive, and always treat bluestore_compression=aggressive implicitly). On the bluestore side the usual compression_blob_size/algorithm options will probably remain necessary, although one might better set them via a mask as in "ceph config set osd/class:hdd compression_min_blob_size XYZ" or better allow combination of masks as in "ceph config set osd/class:hdd,store:blue compression_min_blob_size XYZ" to prepare the config interface for future data stores.
>
> I don't think the compression mode "passive" makes much sense as I have never heard of client software providing a meaningful hint. I think its better treated as an administrator's choice after testing performance and then enabled should simply mean "always compress" and disabled "never compress".
>
> I believe currently there is an interdependence with min_alloc_size on the OSD data store, which makes tuning a bit of a pain. It would be great if physical allocation parameters and logical allocation sizes could be decoupled somewhat. If they need to be coupled, then at least make it possible to read important creation-time settings at run-time. At the moment it is necessary to restart an OSD and grep the log to find the min_alloc_size of an OSD that is actually used by the OSD. Also, with upgraded clusters it is more likely to have OSDs with different min_alloc_sizes in a pool, so it would be great if settings like this one have no/not so much influence on whether or not compression works as expected.
>
> Summary:
>
> - pool enable/disable flag for always/never compress
> - data store flags for compression performance tuning
> - make OSD create- and tune parameters as orthogonal as possible
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Mark Nelson <mnelson@xxxxxxxxxx>
> Sent: 08 August 2022 20:30:49
> To: ceph-users@xxxxxxx
> Subject:  Request for Info: bluestore_compression_mode?
>
> Hi Folks,
>
>
> We are trying to get a sense for how many people are using
> bluestore_compression_mode or the per-pool compression_mode options
> (these were introduced early in bluestore's life, but afaik may not
> widely be used).  We might be able to reduce complexity in bluestore's
> blob code if we could do compression in some other fashion, so we are
> trying to get a sense of whether or not it's something worth looking
> into more.
>
>
> Thanks,
>
> Mark
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx