Hi Mark, please find below a detailed report with data and observations from our production system. The ceph version is mimic-latest and some ways of configuring compression or interpreting settings may or may not have changed already. As far as I know its still pretty much the same. First thing I would like to mention is that you will not be able to get rid of osd blob compression. This is the only way to preserve reasonable performance on applications like ceph FS and RBD. If you are thinking about application-level full file- or object compression, this would probably be acceptable for upload- download only applications. For anything else, something like this would degrade performance to unacceptable levels. Second thing is that the current bluestore compression could be much more effective if the problem of small objects could be addressed. This might happen on application level, for example, implementing tail-merging for ceph fs. This would come with dramatic improvements, because the largest amount of over-allocation does not come from uncompressed but from many small objects (even if compressed). I mentioned this already in our earlier conversation and I will include you in a new thread specifically about my observations in this direction. As one of the important indicators of ineffective compression and huge over-allocation due to small objects on ceph-fs you asked for, please see the output of ceph df below. The pool con-fs2-meta2 is the primary data pool of an FS where the root of the file system is assigned to another data pool con-fs2-data2 - the so-called 3-pool FS layout. As you can see, con-fs2-meta2 contains 50% of all FS objects, yet they are all of size 0. One could say "perfectly compressed" but they all require a min_alloc_size*replication_factor (in our case, 16K*4 on the meta2-pool and 64K*11 on the data2-pool !) allocation on disk. Together with having hundreds of millions of small files on the file system, which also require such a minimum allocation each, a huge waste of raw capacity results. I'm just lucky I don't have the con-fs2-meta2 in the main pool. Its also a huge pain for recovery. I'm pretty sure the same holds for RGW with small objects. The only application that does not have this problem is RBD with its fixed uniform object size. This will not change by application level compression. This requires merging small objects into large ones. I consider this to be currently the major factor for excess raw usage and any improvements of some percent with compression will have only very small effects on a global scale. Looking at the stat numbers below from a real-life HPC system, you can simulate how much one could at best get out of more/better compression. For example, on our main bulk data pool, compressed allocated is only 13%. Even if compression could compress this to size 0, the overall gain would at best be 13%. On the other hand, the actual compression rate of 2.9 is actually quite good. If *all* data was merged into blobs of a minimum size that allowed to save this amount of allocation by compression, one could improve storage capacity by a factor of about 2.5 (250% !). With the current implementation of compression. Consequently, my personal opinion is, that it is not interesting to spend much time on better compression if the small object min_allocation problem is not addressed first. A simplification of the interplay of the current parameters and removal of essentially redundant ones could be interesting just for the purpose of making configuration of compression easier. As you can see in this report, I also got it wrong. The current way is a bit too complicated. Observations on out production cluster ====================================== It seems that the way compression is configured is a bit more complicated/messy than I thought. In our earlier conversation I gave a matrix using all combinations of three blue- and pool- compression_mode options: none, passive and aggressive. Apparently, the possibility "not set" adds yet another row+column to the table. I thought "not set" is equal to "none", but it isn't - with counter-intuitive results. Due to this, I have a pool compressed that I didn't want compressed. Well, I can fix that. There is a very strange observation with raw usage on this pool though, reported at the very end of the stats report below. A second strange observation is that even though bluestore_compression_min_blob_size_ssd is smaller than bluestore_min_alloc_size_ssd data is compressed on SSD OSDs. According to the info I got such a setting should result in no compression at all (well, no allocation savings) because the compressed blob size is always smaller than min_alloc_size and will cause a full allocation of min_alloc_size. Yet, there is a tiny amount of less allocated than stored reported and I'm wondering what is happening here. Pools ===== POOLS: NAME ID USED %USED MAX AVAIL OBJECTS sr-rbd-meta-one 1 90 GiB 0.45 20 TiB 33332 sr-rbd-data-one 2 71 TiB 55.52 57 TiB 25603605 sr-rbd-one-stretch 3 222 GiB 1.09 20 TiB 68813 con-rbd-meta-hpc-one 7 52 KiB 0 1.1 TiB 61 con-rbd-data-hpc-one 8 36 GiB 0 4.9 PiB 9418 sr-rbd-data-one-hdd 11 121 TiB 42.07 167 TiB 32346929 con-fs2-meta1 12 463 MiB 0.05 854 GiB 40314740 con-fs2-meta2 13 0 B 0 854 GiB 408055310 con-fs2-data 14 1.1 PiB 17.79 4.9 PiB 407608732 con-fs2-data-ec-ssd 17 274 GiB 9.10 2.7 TiB 3649114 ms-rbd-one 18 378 GiB 1.85 20 TiB 159631 con-fs2-data2 19 1.3 PiB 23.18 4.5 PiB 589024561 sr-rbd-data-one-perf 20 3.1 TiB 51.88 2.9 TiB 806440 For the effect of compression, the ceph fs layout is important: +---------------------+----------+-------+-------+ | Pool | type | used | avail | +---------------------+----------+-------+-------+ | con-fs2-meta1 | metadata | 492M | 853G | | con-fs2-meta2 | data | 0 | 853G | | con-fs2-data | data | 1086T | 5021T | | con-fs2-data-ec-ssd | data | 273G | 2731G | | con-fs2-data2 | data | 1377T | 4565T | +---------------------+----------+-------+-------+ We have both, the fs meta-data- and the primary data pool on replicated pools on SSD. The data pool con-fs2-data2 is attached to the root of the file system. The data pool con-fs2-data used to be the root and is not attached to any fs path. We changed the bulk data pool from EC 8+2 (con-fs2-data) to 8+3 (con-fs2-data2) and con-fs2-data contains all "old" files on the 8+2 pool. The small pool con-fs2-data-ec-ssd is attached to an apps path for heavily accessed small files. Whether or not the primary data pool of an FS is a separate data pool or not will have a large influence on how effective compression can be. Its a huge amount of small objects that will never be compressed due to their size=0. In general, due to the absence of tail merging, file systems with many small files will suffer from massive over-allocation as well as many blobs being too small for compression. Pools with compression enabled ============================== Keys in output below n : pool_name cm : options.compression_mode sz : size msz : min_size ec : erasure_code_profile EC RBD data pools ----------------- {"n":"sr-rbd-data-one","cm":"aggressive","sz":8,"msz":6,"ec":"sr-ec-6-2-hdd"} {"n":"con-rbd-data-hpc-one","cm":"aggressive","sz":10,"msz":9,"ec":"con-ec-8-2-hdd"} {"n":"sr-rbd-data-one-hdd","cm":"aggressive","sz":8,"msz":7,"ec":"sr-ec-6-2-hdd"} Replicated RBD pools -------------------- {"n":"sr-rbd-one-stretch","cm":"aggressive","sz":3,"msz":2,"ec":""} {"n":"ms-rbd-one","cm":"aggressive","sz":3,"msz":2,"ec":""} EC FS data pools ---------------- {"n":"con-fs2-data","cm":"aggressive","sz":10,"msz":9,"ec":"con-ec-8-2-hdd"} {"n":"con-fs2-data-ec-ssd","cm":"aggressive","sz":10,"msz":9,"ec":"con-ec-8-2-ssd"} {"n":"con-fs2-data2","cm":"aggressive","sz":11,"msz":9,"ec":"con-ec-8-3-hdd"} Relevant OSD settings ===================== bluestore_compression_mode = aggressive bluestore_compression_min_blob_size_hdd = 262144 bluestore_min_alloc_size_hdd = 65536 bluestore_compression_min_blob_size_ssd = 8192 *** Dang! *** bluestore_min_alloc_size_ssd = 16384 Just noticed that I forgot to set bluestore_compression_min_blob_size_ssd to a value that is a multiple of bluestore_min_alloc_size_ssd. I wanted to use 65536, now expected result on SSD pools is no compression at all :( There was a ticket on these defaults and they were set to useful values starting with nautilus. Will look into that at some point. Some compression stats ====================== HDD OSDs in the FS bulk data pool(s) ------------------------------------ These are the most relevant for us. They contain the bulk data. I picked stats from 2 hosts, 1 OSD each. Should be representative for all OSDs. The disks are 18TB with 160 and 153 PGs: "compress_success_count": 72044, "compress_rejected_count": 231282, "bluestore_allocated": 5433725812736, "bluestore_stored": 5788240735652, "bluestore_compressed": 483906706661, "bluestore_compressed_allocated": 699385708544, "bluestore_compressed_original": 1510040834048, "bluestore_extent_compress": 125924, "compress_success_count": 68618, "compress_rejected_count": 221980, "bluestore_allocated": 5101829226496, "bluestore_stored": 5427515391325, "bluestore_compressed": 451595891951, "bluestore_compressed_allocated": 652442533888, "bluestore_compressed_original": 1407811862528, "bluestore_extent_compress": 121594, The success rate is not very high, almost certainly due to many small files on the system. Some people are also using compressed data stores, which will also lead to rejected blobs. Tail-merging could probably improve on that a lot. Also, not having the FS-backtrace objects on this pool prevents a lot of (near-)empty allocations. SSD OSDs with in the small SSD FS data pool ------------------------------------------- Hmm, contrary to expectation, compression actually seems to happen. Maybe my interpretation of min_alloc_size and compression_min_blob_size is wrong? Might be another reason to simplify the compression parameters and explain better how they actually work. Again OSDs picked from 2 hosts, 1 OSD each, 87 and 118 PGs: "compress_success_count": 374638, "compress_rejected_count": 19386, "bluestore_allocated": 6909100032, "bluestore_stored": 4087158731, "bluestore_compressed": 306412527, "bluestore_compressed_allocated": 468615168, "bluestore_compressed_original": 937230336, "bluestore_extent_compress": 952258, "compress_success_count": 387489, "compress_rejected_count": 21764, "bluestore_allocated": 11573510144, "bluestore_stored": 6847593045, "bluestore_compressed": 552832088, "bluestore_compressed_allocated": 844693504, "bluestore_compressed_original": 1689387008, "bluestore_extent_compress": 950922, SSD OSDs in the RBD pool, rep and EC are collocated on the same OSDs -------------------------------------------------------------------- OSDs picked from 2 hosts, 1 OSD each, 161 and 178 PGs: "compress_success_count": 38835730, "compress_rejected_count": 58506800, "bluestore_allocated": 1064052097024, "bluestore_stored": 1322947371131, "bluestore_compressed": 68165358846, "bluestore_compressed_allocated": 114503401472, "bluestore_compressed_original": 289775203840, "bluestore_extent_compress": 61265761, "compress_success_count": 76647709, "compress_rejected_count": 85273926, "bluestore_allocated": 1081196380160, "bluestore_stored": 1399985201821, "bluestore_compressed": 83058256649, "bluestore_compressed_allocated": 139784241152, "bluestore_compressed_original": 350485362688, "bluestore_extent_compress": 86168422, SSD OSD in a rep RBD pool without compression --------------------------------------------- This is a pool with accidentally enabled compression. This pool has SSDs in a special device class exclusively to itself, hence, collective OSD compression matches pool data compresion 1:1. The stats are: "compress_success_count": 41071482, "compress_rejected_count": 1895058, "bluestore_allocated": 171709562880, "bluestore_stored": 304405529094, "bluestore_compressed": 30506295169, "bluestore_compressed_allocated": 132702666752, "bluestore_compressed_original": 265405333504, "bluestore_extent_compress": 48908699, Compression mode (cm) is unset (NULL): {"n":"sr-rbd-data-one-perf","cm":null,"sz":3,"msz":2,"ec":""} What is really strange here is that raw allocation matches the size of uncompressed data times replication factor. With the quite high resulting compression rate raw used should be much smaller than that. What is going on here? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder Sent: 12 August 2022 14:06 To: Mark Nelson; ceph-users@xxxxxxx Subject: Re: Request for Info: bluestore_compression_mode? Hi Mark, ha ha ha, this is a brilliant misunderstanding :) I was under the impression that since mimic all ceph developers were instructed never to mention the ceph.conf file again and only ever talk about the ceph config data base instead. The only allowed options in a config file are the monitor addresses (well, a few more but the idea is a minimal config file). And that's how my config file looks like. OK, I think we do mean the same thing. There are currently 2 sets of compression options, the bluestore and the pool options. All have 3 values and depending on which combination of values is active for a PG, a certain result becomes the final. I believe the actual compression option applied to data was defined by a matrix like that: bluestore opt | pool option | n | p | a n | n | n | n p | n | p | p a | n | p | a I think the bluestore option is redundant, I set these on OSD level: bluestore_compression_min_blob_size_hdd 262144 bluestore_compression_mode aggressive I honestly don't see any use of the bluestore options and neither do I see any use case for mode=passive. Simplifying this matrix to a simple per-pool compression on/off -flag with an option to choose the algorithm per pool as well seems a good idea and might even be a low-hanging fruit. I need to collect some performance data from our OSDs for answering your questions about higher-level compression possibilities. I was a bit busy today with other stuff. I should have something for you next week. Best regards and a nice weekend, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Mark Nelson <mnelson@xxxxxxxxxx> Sent: 11 August 2022 16:51:03 To: Frank Schilder; ceph-users@xxxxxxx Subject: Re: Request for Info: bluestore_compression_mode? On 8/11/22 03:20, Frank Schilder wrote: > Hi Mark, > > I'm preparing a response with some data from our production system and will also open a new thread on the tail merging topic. Both topics are quite large in themselves. Just a quick question for understanding: > >> I was in fact referring to the yaml config and pool options ... > I don't know of a yaml file in ceph. Do you mean the ceph-adm spec file? oh! We switched the conf parsing over to using yaml templates: https://github.com/ceph/ceph/blob/main/src/common/options/global.yaml.in#L4428-L4452 sorry for being unclear here, I just meant the bluestore_compression_mode option you specify in the ceph.conf file. Mark > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Mark Nelson <mnelson@xxxxxxxxxx> > Sent: 10 August 2022 22:28 > To: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: Request for Info: bluestore_compression_mode? > > On 8/10/22 10:08, Frank Schilder wrote: >> Hi Mark. >> >>> I actually had no idea that you needed both the yaml option >>> and the pool option configured > [...] > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx