Re: Request for Info: bluestore_compression_mode?

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 10 Aug 2022 15:28:21 -0500

On 8/10/22 10:08, Frank Schilder wrote:
Hi Mark.

I actually had no idea that you needed both the yaml option
and the pool option configured
I guess you are referring to ceph-adm deployments, which I'm not using. In the ceph config data base, both options mush be enabled irrespective of how this happens (I separate the application ceph from deployment systems, which may or may not have their own logic). There was a longer thread started by me some years ago where someone posted a matrix of how both mode settings interact and what the resulting mode is.

I might have misunderstood what you were saying.  I was in fact 
referring to the yaml config and pool options.  I was under the 
impression that the pool setting overrode whatever was in the yaml and 
you didn't need to sort of chain them to be enabled in both places.  Am 
I mistaken?

Our applications are ceph fs data pools and rbd data pools, all EC pools. This places some heavy requirements on the compression methods in order not to kill IOPs performance completely. I don't know what your long-term goal with this is, just simplify some internals or achieve better storage utilisation. However, something like compression of entire files will probably kill performance to such a degree that it becomes useless.

Mostly the conversation just started out with how we could clean up some 
of the internals to be less complex.  That led to a discussion of how 
compression was implemented along with blobs early on in bluestore's 
life as part of a big write path overhaul.  That led to further 
questions regarding whether or not people actually use it and whether or 
not it's useful out in the field, hence this conversation. :)  The 
general thought process was questioning whether we might be able to 
handle compression more cleanly higher up in the stack (withe the 
trade-off being losing blob level granularity), but we want to be very 
careful since as you say below it could be a lot of effort for little 
gain other than making bluestore simpler (which is a gain to be sure).

I am not sure if you can get much better results out of changes. It looks like you could spend a lot of time on it and gain little. Maybe I can draw your attention to a problem that might lead to much more valuable improvements for both, pool performance and disk utilisation. This is going to be a bit longer, I need to go into details. I hope you find the time to read on.

Ceph has a couple of problems with its EC implementation. One problem that I have never seen discussed so far is the inability of its data stores to perform tail merging. I have opened a feature request (https://tracker.ceph.com/issues/56949) that describes the symptom and only requests a way to account for the excess usage. In the last sentence I mention that tail merging would be the real deal.

The example given there shows how extreme the problem can materialise. Here is the situation as of today while running my benchmark:

status usage: Filesystem                                 Size  Used Avail Use% Mounted on
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/data  2.0T  276G  1.8T  14% /mnt/cephfs
status usage: 10.41.24.13,10.41.24.14,10.41.24.15:/      2.5T  2.1T  419G  84% /mnt/adm/cephfs

Only /data contains any data. The first line shows the ceph.dir.rbytes=276G while the second line shows the pool usage 2.1T. The discrepancy due to small files is more than a factor 7. Compression is enabled, but you won't gain much here because most files are below compression_min_blob_size.

I know that the "solution" to this (and the EC overwrite amplification problem) was chosen to be bluestore_min_alloc_size=4K for all types of devices, which comes with its own problems due to the huge rocks db required and was therefore postponed in octopus. I wonder how this will work on our 18TB hard drives. I personally am not convinced that this is a path of success and, while it reduces the problem of not having tail merging, it does not really remove the need for tail merging. Even on an k=8 EC profile, 4*8=32K is a quite large unit of atomicity. On geo-replicated EC pools even larger values of k are the standard.

Yep, 4K is way better than what we had before with the 64K min_alloc 
size, but the seldomly talked about reality is that if you primarily 
have small (say <8-16K) objects you might want to look at whether or not 
you are actually gaining anything with EC vs replication with the 
current implementation.

Are there any discussions and/or ideas on how to address this? ... in different ways?

There was also a discussion about de-duplication. Are there any news in this direction?

I haven't seen a lot of movement specifically on the EC and de-dup 
fronts, but it's possible that someone is working on them and I'm not in 
the loop.  Going to punt on these.

The following is speculative, based on incomplete knowledge:

An idea I would consider worthwhile is a separation of physical disk allocation from logical object allocation and using full read-modify-copy-on-write cycles for EC overwrites. The blob allocation size on disk should be tailored to accommodate the optimal IO size of the device, which is often larger than 4K or even 64K. All PG operations like rebalance and recovery should always operate on entire blobs regardless of what is stored in them.

Object allocation then happens on a second layer where whenever possible entire blobs are allocated (just like today, one address per blob). However, for small objects (or tails) a dedicated set of such blobs should allow addressing smaller chunks. Such blobs should offer 2, 4, 8, ... equal sized chunks for object allocation. This would require organising blobs in levels of sub-addresses available, possibly following PG stats about actual allocation sizes.

It sounds kind of heavy to me in some ways.  Maybe not for disks though 
where contiguous disk access is the biggest concern.  Would you still 
have the current extent model sitting underneath these? How would shared 
blobs fit in with the sub-allocatable units? Does the 
freespace/allocation strategy change?

What do we gain with this? Rebalance and recovery will profit dramatically. They don't operate on object level, which is a huge pain for many small objects. Instead, they operate on blob level, which is big chunks. All objects (shards) in a single blob will be moved/recovered in a single operation instead object by object. This is particularly important for pure meta-data objects, for example, the backtrace objects created by ceph fs on the primary data pool. I found on our cluster that the bottleneck for recovery are small objects (even on SSD!), not the amount of data. If this bottleneck could be removed, it would be a huge improvement.

Can you tell me a little bit more about what you are seeing with the 
cephfs backtrace objects?  Also, would you mind talking a bit more about 
what you saw with recovery performance on SSD?  Did increasing recovery 
parallelization help at all?  I don't want to get too sidetracked from 
the primary topic, but user feedback on this kind of stuff is always useful.

I guess that this idea goes to the fundamentals of ceph, namely that rados operates on object level, which means that this separation of storing data into two instead of one logical layer would require a fundamental change of rados and cannot be done on OSD level alone (the objects that rados sees may no longer be the same objects that a user sees). This change might be so fundamental that there is no upgrade path without data migration.

On the other hand, a dead end for development requiring such an upgrade path is likely to come any ways.

I hope this was not a total waste of time.

I don't think so.  It's interesting to hear what people are struggling 
with and ideas to make it better.  Rados level protocol changes 
(especially one like that) along with the implementation changes would 
require extremely heavy lifting though.  I suspect we'd shoot for easier 
performance wins first where we can get them.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 09 August 2022 16:56:19
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Request for Info: bluestore_compression_mode?

Hi Frank,

Thank you very much for the reply!  If you don't mind me asking, what's
the use case?  We're trying to determine if we might be able to do
compression at a higher level than blob with the eventual goal of
simplifying the underlying data structures.  I actually had no idea that
you needed both the yaml option and the pool option configured (I
figured the pool option just overrode the yaml).  That's definitely
confusing!

Not sure what the right path is here or if we should even make any
significant changes at this point, but we figured that the first step
was to figure out if people are using it and how.

Mark

On 8/9/22 04:11, Frank Schilder wrote:
Hi Mark,

we are using per-pool aggressive compression mode on any EC data pool. We need it per pool as we also have un-compressed replicated meta data pools sharing the same OSDs. Currently, one needs to enable both for data compression, the bluestore option to enable compression on an OSD and the pool option to enable compression for a pool. Only when both options are active simultaneously is data actually compressed, which led to quite a bit of confusion in the past. I think per-pool compression should be sufficient and imply compression without further tweaks on the OSD side. I don't know what the objective with per-OSD bluestore compression was. We just enabled bluestore compression globally since the pool option selects the data for compression and its the logical way to select and enforce compression (per data type).

Just an enable/disable setting for pools would be sufficient (enabled=aggressive, and always treat bluestore_compression=aggressive implicitly). On the bluestore side the usual compression_blob_size/algorithm options will probably remain necessary, although one might better set them via a mask as in "ceph config set osd/class:hdd compression_min_blob_size XYZ" or better allow combination of masks as in "ceph config set osd/class:hdd,store:blue compression_min_blob_size XYZ" to prepare the config interface for future data stores.

I don't think the compression mode "passive" makes much sense as I have never heard of client software providing a meaningful hint. I think its better treated as an administrator's choice after testing performance and then enabled should simply mean "always compress" and disabled "never compress".

I believe currently there is an interdependence with min_alloc_size on the OSD data store, which makes tuning a bit of a pain. It would be great if physical allocation parameters and logical allocation sizes could be decoupled somewhat. If they need to be coupled, then at least make it possible to read important creation-time settings at run-time. At the moment it is necessary to restart an OSD and grep the log to find the min_alloc_size of an OSD that is actually used by the OSD. Also, with upgraded clusters it is more likely to have OSDs with different min_alloc_sizes in a pool, so it would be great if settings like this one have no/not so much influence on whether or not compression works as expected.

Summary:

- pool enable/disable flag for always/never compress
- data store flags for compression performance tuning
- make OSD create- and tune parameters as orthogonal as possible

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 08 August 2022 20:30:49
To: ceph-users@xxxxxxx
Subject:  Request for Info: bluestore_compression_mode?

Hi Folks,

We are trying to get a sense for how many people are using
bluestore_compression_mode or the per-pool compression_mode options
(these were introduced early in bluestore's life, but afaik may not
widely be used).  We might be able to reduce complexity in bluestore's
blob code if we could do compression in some other fashion, so we are
trying to get a sense of whether or not it's something worth looking
into more.

Thanks,

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx