Many thanks for this, Eugen! I very much appreciate yours and Mykola's efforts and insight! Another thing I noticed was a reduction of RocksDB store after the reduction of the total PG number by 30%, from 590-600 MB: 65M 3675511.sst 65M 3675512.sst 65M 3675513.sst 65M 3675514.sst 65M 3675515.sst 65M 3675516.sst 65M 3675517.sst 65M 3675518.sst 62M 3675519.sst to about half of the original size: -rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst Then when I restarted the monitors one by one before adding compression, RocksDB store reduced even further. I am not sure why and what exactly got automatically removed from the store: -rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst Then I have enabled LZ4 and LZ4HC compression in our small production cluster (6 nodes, 96 OSDs) on 3 out of 5 monitors: compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression. I specifically went for LZ4 and LZ4HC because of the balance between compression/decompression speed and impact on CPU usage. The compression doesn't seem to affect the cluster in any negative way, the 3 monitors with compression are operating normally. The effect of the compression on RocksDB store size and disk writes is quite noticeable: Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB writes over 5 minutes: -rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst # du -hs /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/; iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon 155M /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/ 2471602 be/4 167 6.05 M 473.24 M 0.00 % 0.16 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:low0] 2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [ms_dispatch] 2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:high0] Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of writes over 5 minutes: -rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst # du -hs /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/; iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon 60M /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/ 2052031 be/4 167 1040.00 K 83.48 M 0.00 % 0.01 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:low0] 2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [ms_dispatch] 2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:high0] 2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [msgr-worker-0] I haven't noticed a major CPU impact. Unfortunately I didn't specifically measure CPU time for monitors and , but overall the CPU impact of monitor store compression on our systems isn't noticeable. This may be different for larger clusters with larger RocksDB datasets, then perhaps compression=kLZ4Compression can be enabled by defualt and bottommost_compression=kLZ4HCCompression can be optional, in theory this should result in lower but much faster compression. I hope this helps. My plan is to keep the monitors with the current settings, i.e. 3 with compression + 2 without compression, until the next minor release of Pacific to see whether the monitors with compressed RocksDB store can be upgraded without issues. /Z On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote: > Hi Zakhar, > > I took a closer look into what the MONs really do (again with Mykola's > help) and why manual compaction is triggered so frequently. With > debug_paxos=20 I noticed that paxosservice and paxos triggered manual > compactions. So I played with these values: > > paxos_service_trim_max = 1000 (default 500) > paxos_service_trim_min = 500 (default 250) > paxos_trim_max = 1000 (default 500) > paxos_trim_min = 500 (default 250) > > This reduced the amount of writes by a factor of 3 or 4, the iotop > values are fluctuating a bit, of course. As Mykola suggested I created > a tracker issue [1] to increase the default values since they don't > seem suitable for a production environment. Although I don't have > tested that in production yet I'll ask one of our customers to do that > in their secondary cluster (for rbd mirroring) where they also suffer > from large mon stores and heavy writes to the mon store. Your findings > with the compaction were quite helpful as well, we'll test that as well. > Igor mentioned that the default bluestore_rocksdb config for OSDs will > enable compression because of positive test results. If we can confirm > that compression works well for MONs too, compression could be enabled > by default as well. > > Regards, > Eugen > > https://tracker.ceph.com/issues/63229 > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > With the help of community members, I managed to enable RocksDB > compression > > for a test monitor, and it seems to be working well. > > > > Monitor w/o compression writes about 750 MB to disk in 5 minutes: > > > > 4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 % ceph-mon -n > > mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false > > --default-log-to-stderr=true --default-log-stderr-prefix=debug > > --default-mon-cluster-log-to-file=false > > --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > > > > Monitor with LZ4 compression writes about 1/4 of that over the same time > > period: > > > > 2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 % ceph-mon -n > > mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false > > --default-log-to-stderr=true --default-log-stderr-prefix=debug > > --default-mon-cluster-log-to-file=false > > --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > > > > This is caused by the apparent difference in store.db sizes. > > > > Mon store.db w/o compression: > > > > # ls -al > > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db > > total 257196 > > drwxr-xr-x 2 167 167 4096 Oct 16 14:00 . > > drwx------ 3 167 167 4096 Aug 31 05:22 .. > > -rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log > > -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst > > -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst > > -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst > > > > Mon store.db with compression: > > > > # ls -al > > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db > > total 91188 > > drwxr-xr-x 2 167 167 4096 Oct 16 14:00 . > > drwx------ 3 167 167 4096 Oct 16 13:35 .. > > -rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log > > -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst > > > > There are no apparent downsides thus far. If everything works well, I > will > > try adding compression to other monitors. > > > > /Z > > > > On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > > > >> The issue persists, although to a lesser extent. Any comments from the > >> Ceph team please? > >> > >> /Z > >> > >> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > >> > >>> > Some of it is transferable to RocksDB on mons nonetheless. > >>> > >>> Please point me to relevant Ceph documentation, i.e. a description of > how > >>> various Ceph monitor and RocksDB tunables affect the operations of > >>> monitors, I'll gladly look into it. > >>> > >>> > Please point me to such recommendations, if they're on docs.ceph.com > I'll > >>> get them updated. > >>> > >>> This are the recommendations we used when we built our Pacific cluster: > >>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/ > >>> > >>> Our drives are 4x times larger than recommended by this guide. The > drives > >>> are rated for < 0.5 DWPD, which is more than sufficient for boot > drives and > >>> storage of rarely modified files. It is not documented or suggested > >>> anywhere that monitor processes write several hundred gigabytes of > data per > >>> day, exceeding the amount of data written by OSDs. Which is why I am > not > >>> convinced that what we're observing is expected behavior, but it's not > easy > >>> to get a definitive answer from the Ceph community. > >>> > >>> /Z > >>> > >>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <anthony.datri@xxxxxxxxx> > >>> wrote: > >>> > >>>> Some of it is transferable to RocksDB on mons nonetheless. > >>>> > >>>> but their specs exceed Ceph hardware recommendations by a good margin > >>>> > >>>> > >>>> Please point me to such recommendations, if they're on docs.ceph.com > I'll > >>>> get them updated. > >>>> > >>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > >>>> > >>>> Thank you, Anthony. As I explained to you earlier, the article you had > >>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue > >>>> at hand is > >>>> not with OSDs but rather monitors and their RocksDB store. Indeed, the > >>>> drives are not enterprise-grade, but their specs exceed Ceph hardware > >>>> recommendations by a good margin, they're being used as boot drives > only > >>>> and aren't supposed to be written to continuously at high rates - > which is > >>>> what unfortunately is happening. I am trying to determine why it is > >>>> happening and how the issue can be alleviated or resolved, > unfortunately > >>>> monitor RocksDB usage and tunables appear to be not documented at all. > >>>> > >>>> /Z > >>>> > >>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <anthony.datri@xxxxxxxxx > > > >>>> wrote: > >>>> > >>>>> cf. Mark's article I sent you re RocksDB tuning. I suspect that with > >>>>> Reef you would experience fewer writes. Universal compaction might > also > >>>>> help, but in the end this SSD is a client SKU and really not suited > for > >>>>> enterprise use. If you had the 1TB SKU you'd get much longer > >>>>> life, or you > >>>>> could change the overprovisioning on the ones you have. > >>>>> > >>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > wrote: > >>>>> > >>>>> I would very much appreciate it if someone with a better > understanding > >>>>> of > >>>>> monitor internals and use of RocksDB could please chip in. > >>>>> > >>>>> > >>>>> > >>>> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx