Frank, The only changes in ceph.conf are just the compression settings, most of the cluster configuration is in the monitor database thus my ceph.conf is rather short: --- [global] fsid = xxx mon_host = [list of mons] [mon.yyy] public network = a.b.c.d/e mon_rocksdb_options = "write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true,bottommost_compression=kLZ4HCCompression" --- Note that my bottommost_compression choice is LZ4HC, whose compression is better than LZ4 at the expense of higher CPU usage. My nodes have lots of CPU to spare, so I went for LZ4HC for better space savings and a lower amount of writes. In general, I would recommend trying a faster and less intense compression first, LZ4 across the board is a good starting choice. /Z On Wed, 18 Oct 2023 at 12:02, Frank Schilder <frans@xxxxxx> wrote: > Hi Zakhar, > > since its a bit beyond of the scope of basic, could you please post the > complete ceph.conf config section for these changes for reference? > > Thanks! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Zakhar Kirpichenko <zakhar@xxxxxxxxx> > Sent: Wednesday, October 18, 2023 6:14 AM > To: Eugen Block > Cc: ceph-users@xxxxxxx > Subject: Re: Ceph 16.2.x mon compactions, disk writes > > Many thanks for this, Eugen! I very much appreciate yours and Mykola's > efforts and insight! > > Another thing I noticed was a reduction of RocksDB store after the > reduction of the total PG number by 30%, from 590-600 MB: > > 65M 3675511.sst > 65M 3675512.sst > 65M 3675513.sst > 65M 3675514.sst > 65M 3675515.sst > 65M 3675516.sst > 65M 3675517.sst > 65M 3675518.sst > 62M 3675519.sst > > to about half of the original size: > > -rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log > -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst > -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst > -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst > > Then when I restarted the monitors one by one before adding compression, > RocksDB store reduced even further. I am not sure why and what exactly got > automatically removed from the store: > > -rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log > -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst > -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst > > Then I have enabled LZ4 and LZ4HC compression in our small production > cluster (6 nodes, 96 OSDs) on 3 out of 5 > monitors: > compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression. > I specifically went for LZ4 and LZ4HC because of the balance between > compression/decompression speed and impact on CPU usage. The compression > doesn't seem to affect the cluster in any negative way, the 3 monitors with > compression are operating normally. The effect of the compression on > RocksDB store size and disk writes is quite noticeable: > > Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB > writes over 5 minutes: > > -rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log > -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst > -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst > > # du -hs > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/; > iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon > 155M > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/ > 2471602 be/4 167 6.05 M 473.24 M 0.00 % 0.16 % ceph-mon -n > mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > 2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 % ceph-mon -n > mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [ms_dispatch] > 2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 % ceph-mon -n > mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [rocksdb:high0] > > Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of > writes over 5 minutes: > > -rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log > -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst > > # du -hs > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/; > iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon > 60M > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/ > 2052031 be/4 167 1040.00 K 83.48 M 0.00 % 0.01 % ceph-mon -n > mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > 2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 % ceph-mon -n > mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [ms_dispatch] > 2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 % ceph-mon -n > mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [rocksdb:high0] > 2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 % ceph-mon -n > mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false > --default-log-to-stderr=true --default-log-stderr-prefix=debug > --default-mon-cluster-log-to-file=false > --default-mon-cluster-log-to-stderr=true [msgr-worker-0] > > I haven't noticed a major CPU impact. Unfortunately I didn't specifically > measure CPU time for monitors and , but overall the CPU impact of monitor > store compression on our systems isn't noticeable. This may be different > for larger clusters with larger RocksDB datasets, then perhaps > compression=kLZ4Compression can be enabled by defualt and > bottommost_compression=kLZ4HCCompression can be optional, in theory this > should result in lower but much faster compression. > > I hope this helps. My plan is to keep the monitors with the current > settings, i.e. 3 with compression + 2 without compression, until the next > minor release of Pacific to see whether the monitors with compressed > RocksDB store can be upgraded without issues. > > /Z > > > On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote: > > > Hi Zakhar, > > > > I took a closer look into what the MONs really do (again with Mykola's > > help) and why manual compaction is triggered so frequently. With > > debug_paxos=20 I noticed that paxosservice and paxos triggered manual > > compactions. So I played with these values: > > > > paxos_service_trim_max = 1000 (default 500) > > paxos_service_trim_min = 500 (default 250) > > paxos_trim_max = 1000 (default 500) > > paxos_trim_min = 500 (default 250) > > > > This reduced the amount of writes by a factor of 3 or 4, the iotop > > values are fluctuating a bit, of course. As Mykola suggested I created > > a tracker issue [1] to increase the default values since they don't > > seem suitable for a production environment. Although I don't have > > tested that in production yet I'll ask one of our customers to do that > > in their secondary cluster (for rbd mirroring) where they also suffer > > from large mon stores and heavy writes to the mon store. Your findings > > with the compaction were quite helpful as well, we'll test that as well. > > Igor mentioned that the default bluestore_rocksdb config for OSDs will > > enable compression because of positive test results. If we can confirm > > that compression works well for MONs too, compression could be enabled > > by default as well. > > > > Regards, > > Eugen > > > > https://tracker.ceph.com/issues/63229 > > > > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>: > > > > > With the help of community members, I managed to enable RocksDB > > compression > > > for a test monitor, and it seems to be working well. > > > > > > Monitor w/o compression writes about 750 MB to disk in 5 minutes: > > > > > > 4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 % ceph-mon > -n > > > mon.ceph04 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > > > --default-log-to-stderr=true --default-log-stderr-prefix=debug > > > --default-mon-cluster-log-to-file=false > > > --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > > > > > > Monitor with LZ4 compression writes about 1/4 of that over the same > time > > > period: > > > > > > 2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 % ceph-mon > -n > > > mon.ceph05 -f --setuser ceph --setgroup ceph > --default-log-to-file=false > > > --default-log-to-stderr=true --default-log-stderr-prefix=debug > > > --default-mon-cluster-log-to-file=false > > > --default-mon-cluster-log-to-stderr=true [rocksdb:low0] > > > > > > This is caused by the apparent difference in store.db sizes. > > > > > > Mon store.db w/o compression: > > > > > > # ls -al > > > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db > > > total 257196 > > > drwxr-xr-x 2 167 167 4096 Oct 16 14:00 . > > > drwx------ 3 167 167 4096 Aug 31 05:22 .. > > > -rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log > > > -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst > > > -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst > > > -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst > > > > > > Mon store.db with compression: > > > > > > # ls -al > > > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db > > > total 91188 > > > drwxr-xr-x 2 167 167 4096 Oct 16 14:00 . > > > drwx------ 3 167 167 4096 Oct 16 13:35 .. > > > -rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log > > > -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst > > > > > > There are no apparent downsides thus far. If everything works well, I > > will > > > try adding compression to other monitors. > > > > > > /Z > > > > > > On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > > wrote: > > > > > >> The issue persists, although to a lesser extent. Any comments from the > > >> Ceph team please? > > >> > > >> /Z > > >> > > >> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > > wrote: > > >> > > >>> > Some of it is transferable to RocksDB on mons nonetheless. > > >>> > > >>> Please point me to relevant Ceph documentation, i.e. a description of > > how > > >>> various Ceph monitor and RocksDB tunables affect the operations of > > >>> monitors, I'll gladly look into it. > > >>> > > >>> > Please point me to such recommendations, if they're on > docs.ceph.com > > I'll > > >>> get them updated. > > >>> > > >>> This are the recommendations we used when we built our Pacific > cluster: > > >>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/ > > >>> > > >>> Our drives are 4x times larger than recommended by this guide. The > > drives > > >>> are rated for < 0.5 DWPD, which is more than sufficient for boot > > drives and > > >>> storage of rarely modified files. It is not documented or suggested > > >>> anywhere that monitor processes write several hundred gigabytes of > > data per > > >>> day, exceeding the amount of data written by OSDs. Which is why I am > > not > > >>> convinced that what we're observing is expected behavior, but it's > not > > easy > > >>> to get a definitive answer from the Ceph community. > > >>> > > >>> /Z > > >>> > > >>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri < > anthony.datri@xxxxxxxxx> > > >>> wrote: > > >>> > > >>>> Some of it is transferable to RocksDB on mons nonetheless. > > >>>> > > >>>> but their specs exceed Ceph hardware recommendations by a good > margin > > >>>> > > >>>> > > >>>> Please point me to such recommendations, if they're on > docs.ceph.com > > I'll > > >>>> get them updated. > > >>>> > > >>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > > wrote: > > >>>> > > >>>> Thank you, Anthony. As I explained to you earlier, the article you > had > > >>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue > > >>>> at hand is > > >>>> not with OSDs but rather monitors and their RocksDB store. Indeed, > the > > >>>> drives are not enterprise-grade, but their specs exceed Ceph > hardware > > >>>> recommendations by a good margin, they're being used as boot drives > > only > > >>>> and aren't supposed to be written to continuously at high rates - > > which is > > >>>> what unfortunately is happening. I am trying to determine why it is > > >>>> happening and how the issue can be alleviated or resolved, > > unfortunately > > >>>> monitor RocksDB usage and tunables appear to be not documented at > all. > > >>>> > > >>>> /Z > > >>>> > > >>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri < > anthony.datri@xxxxxxxxx > > > > > >>>> wrote: > > >>>> > > >>>>> cf. Mark's article I sent you re RocksDB tuning. I suspect that > with > > >>>>> Reef you would experience fewer writes. Universal compaction might > > also > > >>>>> help, but in the end this SSD is a client SKU and really not suited > > for > > >>>>> enterprise use. If you had the 1TB SKU you'd get much longer > > >>>>> life, or you > > >>>>> could change the overprovisioning on the ones you have. > > >>>>> > > >>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx> > > wrote: > > >>>>> > > >>>>> I would very much appreciate it if someone with a better > > understanding > > >>>>> of > > >>>>> monitor internals and use of RocksDB could please chip in. > > >>>>> > > >>>>> > > >>>>> > > >>>> > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx