I remember that I found the part which said "if something goes wrong, monitors will fail" rather discouraging :-) /Z On Tue, 16 Apr 2024 at 18:59, Eugen Block <eblock@xxxxxx> wrote: > Sorry, I meant extra-entrypoint-arguments: > > https://www.spinics.net/lists/ceph-users/msg79251.html > > Zitat von Eugen Block <eblock@xxxxxx>: > > > You can use the extra container arguments I pointed out a few months > > ago. Those work in my test clusters, although I haven’t enabled that > > in production yet. But it shouldn’t make a difference if it’s a test > > cluster or not. 😉 > > > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > >> Hi, > >> > >>> Did you noticed any downsides with your compression settings so far? > >> > >> None, at least on our systems. Except the part that I haven't found a > way > >> to make the settings persist. > >> > >>> Do you have all mons now on compression? > >> > >> I have 3 out of 5 monitors with compression and 2 without it. The 2 > >> monitors with uncompressed RocksDB have much larger disks which do not > >> suffer from writes as much as the other 3. I keep them uncompressed > "just > >> in case", i.e. for the unlikely event if the 3 monitors with compressed > >> RocksDB fail or have any issues specifically because of the > compression. I > >> have to say that this hasn't happened yet, and this precaution may be > >> unnecessary. > >> > >>> Did release updates go through without issues? > >> > >> In our case, container updates overwrite the monitors' configurations > and > >> reset RocksDB options, thus each updated monitor runs with no RocksDB > >> compression until it is added back manually. Other than that, I have not > >> encountered any issues related to compression during the updates. > >> > >>> Do you know if this works also with reef (we see massive writes as well > >> there)? > >> > >> Unfortunately, I can't comment on Reef as we're still using Pacific. > >> > >> /Z > >> > >> On Tue, 16 Apr 2024 at 18:08, Dietmar Rieder < > dietmar.rieder@xxxxxxxxxxx> > >> wrote: > >> > >>> Hi Zakhar, hello List, > >>> > >>> I just wanted to follow up on this and ask a few quesitions: > >>> > >>> Did you noticed any downsides with your compression settings so far? > >>> Do you have all mons now on compression? > >>> Did release updates go through without issues? > >>> Do you know if this works also with reef (we see massive writes as well > >>> there)? > >>> > >>> Can you briefly tabulate the commands you used to persistently set the > >>> compression options? > >>> > >>> Thanks so much, > >>> > >>> Dietmar > >>> > >>> > >>> On 10/18/23 06:14, Zakhar Kirpichenko wrote: > >>>> Many thanks for this, Eugen! I very much appreciate yours and Mykola's > >>>> efforts and insight! > >>>> > >>>> Another thing I noticed was a reduction of RocksDB store after the > >>>> reduction of the total PG number by 30%, from 590-600 MB: > >>>> > >>>> 65M 3675511.sst > >>>> 65M 3675512.sst > >>>> 65M 3675513.sst > >>>> 65M 3675514.sst > >>>> 65M 3675515.sst > >>>> 65M 3675516.sst > >>>> 65M 3675517.sst > >>>> 65M 3675518.sst > >>>> 62M 3675519.sst > >>>> > >>>> to about half of the original size: > >>>> > >>>> -rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log > >>>> -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst > >>>> -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst > >>>> -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst > >>>> > >>>> Then when I restarted the monitors one by one before adding > compression, > >>>> RocksDB store reduced even further. I am not sure why and what exactly > >>> got > >>>> automatically removed from the store: > >>>> > >>>> -rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log > >>>> -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst > >>>> -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst > >>>> > >>>> Then I have enabled LZ4 and LZ4HC compression in our small production > >>>> cluster (6 nodes, 96 OSDs) on 3 out of 5 > >>>> monitors: > >>> compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression. > >>>> I specifically went for LZ4 and LZ4HC because of the balance between > >>>> compression/decompression speed and impact on CPU usage. The > compression > >>>> doesn't seem to affect the cluster in any negative way, the 3 monitors > >>> with > >>>> compression are operating normally. The effect of the compression on > >>>> RocksDB store size and disk writes is quite noticeable: > >>>> > >>>> Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 > MB > >>>> writes over 5 minutes: > >>>> > >>>> -rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log > >>>> -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst > >>>> -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst > >>>> > >>>> # du -hs > >>>> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/; > >>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon > >>>> 155M > >>>> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/ > >>>> 2471602 be/4 167 6.05 M 473.24 M 0.00 % 0.16 % > ceph-mon -n > >>>> mon.ceph04 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > >>>> 2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 % > ceph-mon -n > >>>> mon.ceph04 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch] > >>>> 2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 % > ceph-mon -n > >>>> mon.ceph04 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0] > >>>> > >>>> Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB > of > >>>> writes over 5 minutes: > >>>> > >>>> -rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log > >>>> -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst > >>>> > >>>> # du -hs > >>>> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/; > >>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon > >>>> 60M > >>>> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/ > >>>> 2052031 be/4 167 1040.00 K 83.48 M 0.00 % 0.01 % > ceph-mon -n > >>>> mon.ceph03 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > >>>> 2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 % > ceph-mon -n > >>>> mon.ceph03 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch] > >>>> 2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 % > ceph-mon -n > >>>> mon.ceph03 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0] > >>>> 2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 % > ceph-mon -n > >>>> mon.ceph03 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>> --default-mon-cluster-log-to-file=false > >>>> --default-mon-cluster-log-to-stderr=true [msgr-worker-0] > >>>> > >>>> I haven't noticed a major CPU impact. Unfortunately I didn't > specifically > >>>> measure CPU time for monitors and , but overall the CPU impact of > monitor > >>>> store compression on our systems isn't noticeable. This may be > different > >>>> for larger clusters with larger RocksDB datasets, then perhaps > >>>> compression=kLZ4Compression can be enabled by defualt and > >>>> bottommost_compression=kLZ4HCCompression can be optional, in theory > this > >>>> should result in lower but much faster compression. > >>>> > >>>> I hope this helps. My plan is to keep the monitors with the current > >>>> settings, i.e. 3 with compression + 2 without compression, until the > next > >>>> minor release of Pacific to see whether the monitors with compressed > >>>> RocksDB store can be upgraded without issues. > >>>> > >>>> /Z > >>>> > >>>> > >>>> On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote: > >>>> > >>>>> Hi Zakhar, > >>>>> > >>>>> I took a closer look into what the MONs really do (again with > Mykola's > >>>>> help) and why manual compaction is triggered so frequently. With > >>>>> debug_paxos=20 I noticed that paxosservice and paxos triggered manual > >>>>> compactions. So I played with these values: > >>>>> > >>>>> paxos_service_trim_max = 1000 (default 500) > >>>>> paxos_service_trim_min = 500 (default 250) > >>>>> paxos_trim_max = 1000 (default 500) > >>>>> paxos_trim_min = 500 (default 250) > >>>>> > >>>>> This reduced the amount of writes by a factor of 3 or 4, the iotop > >>>>> values are fluctuating a bit, of course. As Mykola suggested I > created > >>>>> a tracker issue [1] to increase the default values since they don't > >>>>> seem suitable for a production environment. Although I don't have > >>>>> tested that in production yet I'll ask one of our customers to do > that > >>>>> in their secondary cluster (for rbd mirroring) where they also suffer > >>>>> from large mon stores and heavy writes to the mon store. Your > findings > >>>>> with the compaction were quite helpful as well, we'll test that as > well. > >>>>> Igor mentioned that the default bluestore_rocksdb config for OSDs > will > >>>>> enable compression because of positive test results. If we can > confirm > >>>>> that compression works well for MONs too, compression could be > enabled > >>>>> by default as well. > >>>>> > >>>>> Regards, > >>>>> Eugen > >>>>> > >>>>> https://tracker.ceph.com/issues/63229 > >>>>> > >>>>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > >>>>> > >>>>>> With the help of community members, I managed to enable RocksDB > >>>>> compression > >>>>>> for a test monitor, and it seems to be working well. > >>>>>> > >>>>>> Monitor w/o compression writes about 750 MB to disk in 5 minutes: > >>>>>> > >>>>>> 4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 % > >>> ceph-mon -n > >>>>>> mon.ceph04 -f --setuser ceph --setgroup ceph > >>> --default-log-to-file=false > >>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>>>> --default-mon-cluster-log-to-file=false > >>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > >>>>>> > >>>>>> Monitor with LZ4 compression writes about 1/4 of that over the same > >>> time > >>>>>> period: > >>>>>> > >>>>>> 2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 % > ceph-mon > >>> -n > >>>>>> mon.ceph05 -f --setuser ceph --setgroup ceph > >>> --default-log-to-file=false > >>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug > >>>>>> --default-mon-cluster-log-to-file=false > >>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > >>>>>> > >>>>>> This is caused by the apparent difference in store.db sizes. > >>>>>> > >>>>>> Mon store.db w/o compression: > >>>>>> > >>>>>> # ls -al > >>>>>> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db > >>>>>> total 257196 > >>>>>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 . > >>>>>> drwx------ 3 167 167 4096 Aug 31 05:22 .. > >>>>>> -rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log > >>>>>> -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst > >>>>>> -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst > >>>>>> -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst > >>>>>> > >>>>>> Mon store.db with compression: > >>>>>> > >>>>>> # ls -al > >>>>>> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db > >>>>>> total 91188 > >>>>>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 . > >>>>>> drwx------ 3 167 167 4096 Oct 16 13:35 .. > >>>>>> -rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log > >>>>>> -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst > >>>>>> > >>>>>> There are no apparent downsides thus far. If everything works well, > I > >>>>> will > >>>>>> try adding compression to other monitors. > >>>>>> > >>>>>> /Z > >>>>>> > >>>>>> On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > >>>>> wrote: > >>>>>> > >>>>>>> The issue persists, although to a lesser extent. Any comments from > the > >>>>>>> Ceph team please? > >>>>>>> > >>>>>>> /Z > >>>>>>> > >>>>>>> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx > > > >>>>> wrote: > >>>>>>> > >>>>>>>>> Some of it is transferable to RocksDB on mons nonetheless. > >>>>>>>> > >>>>>>>> Please point me to relevant Ceph documentation, i.e. a > description of > >>>>> how > >>>>>>>> various Ceph monitor and RocksDB tunables affect the operations of > >>>>>>>> monitors, I'll gladly look into it. > >>>>>>>> > >>>>>>>>> Please point me to such recommendations, if they're on > >>> docs.ceph.com > >>>>> I'll > >>>>>>>> get them updated. > >>>>>>>> > >>>>>>>> This are the recommendations we used when we built our Pacific > >>> cluster: > >>>>>>>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/ > >>>>>>>> > >>>>>>>> Our drives are 4x times larger than recommended by this guide. The > >>>>> drives > >>>>>>>> are rated for < 0.5 DWPD, which is more than sufficient for boot > >>>>> drives and > >>>>>>>> storage of rarely modified files. It is not documented or > suggested > >>>>>>>> anywhere that monitor processes write several hundred gigabytes of > >>>>> data per > >>>>>>>> day, exceeding the amount of data written by OSDs. Which is why I > am > >>>>> not > >>>>>>>> convinced that what we're observing is expected behavior, but it's > >>> not > >>>>> easy > >>>>>>>> to get a definitive answer from the Ceph community. > >>>>>>>> > >>>>>>>> /Z > >>>>>>>> > >>>>>>>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri < > >>> anthony.datri@xxxxxxxxx> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Some of it is transferable to RocksDB on mons nonetheless. > >>>>>>>>> > >>>>>>>>> but their specs exceed Ceph hardware recommendations by a good > >>> margin > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Please point me to such recommendations, if they're on > >>> docs.ceph.com > >>>>> I'll > >>>>>>>>> get them updated. > >>>>>>>>> > >>>>>>>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>> Thank you, Anthony. As I explained to you earlier, the article > you > >>> had > >>>>>>>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue > >>>>>>>>> at hand is > >>>>>>>>> not with OSDs but rather monitors and their RocksDB store. > Indeed, > >>> the > >>>>>>>>> drives are not enterprise-grade, but their specs exceed Ceph > >>> hardware > >>>>>>>>> recommendations by a good margin, they're being used as boot > drives > >>>>> only > >>>>>>>>> and aren't supposed to be written to continuously at high rates - > >>>>> which is > >>>>>>>>> what unfortunately is happening. I am trying to determine why it > is > >>>>>>>>> happening and how the issue can be alleviated or resolved, > >>>>> unfortunately > >>>>>>>>> monitor RocksDB usage and tunables appear to be not documented at > >>> all. > >>>>>>>>> > >>>>>>>>> /Z > >>>>>>>>> > >>>>>>>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri < > >>> anthony.datri@xxxxxxxxx > >>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> cf. Mark's article I sent you re RocksDB tuning. I suspect that > >>> with > >>>>>>>>>> Reef you would experience fewer writes. Universal compaction > might > >>>>> also > >>>>>>>>>> help, but in the end this SSD is a client SKU and really not > suited > >>>>> for > >>>>>>>>>> enterprise use. If you had the 1TB SKU you'd get much longer > >>>>>>>>>> life, or you > >>>>>>>>>> could change the overprovisioning on the ones you have. > >>>>>>>>>> > >>>>>>>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx > > > >>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> I would very much appreciate it if someone with a better > >>>>> understanding > >>>>>>>>>> of > >>>>>>>>>> monitor internals and use of RocksDB could please chip in. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>> > >>> > >>> -- > >>> _________________________________________________________ > >>> D i e t m a r R i e d e r > >>> Innsbruck Medical University > >>> Biocenter - Institute of Bioinformatics > >>> Innrain 80, 6020 Innsbruck > >>> Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402 > >>> Email: dietmar.rieder@xxxxxxxxxxx > >>> Web: http://www.icbi.at > >>> > >>> > >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx