Re: Ceph 16.2.x mon compactions, disk writes

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Wed, 18 Oct 2023 16:22:37 +0300

Frank,

The only changes in ceph.conf are just the compression settings, most of
the cluster configuration is in the monitor database thus my ceph.conf is
rather short:

---
[global]
        fsid = xxx
        mon_host = [list of mons]

[mon.yyy]
public network = a.b.c.d/e
mon_rocksdb_options =
"write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true,bottommost_compression=kLZ4HCCompression"
---

Note that my bottommost_compression choice is LZ4HC, whose compression is
better than LZ4 at the expense of higher CPU usage. My nodes have lots of
CPU to spare, so I went for LZ4HC for better space savings and a lower
amount of writes. In general, I would recommend trying a faster and less
intense compression first, LZ4 across the board is a good starting choice.

/Z

On Wed, 18 Oct 2023 at 12:02, Frank Schilder <frans@xxxxxx> wrote:

> Hi Zakhar,
>
> since its a bit beyond of the scope of basic, could you please post the
> complete ceph.conf config section for these changes for reference?
>
> Thanks!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> Sent: Wednesday, October 18, 2023 6:14 AM
> To: Eugen Block
> Cc: ceph-users@xxxxxxx
> Subject:  Re: Ceph 16.2.x mon compactions, disk writes
>
> Many thanks for this, Eugen! I very much appreciate yours and Mykola's
> efforts and insight!
>
> Another thing I noticed was a reduction of RocksDB store after the
> reduction of the total PG number by 30%, from 590-600 MB:
>
> 65M     3675511.sst
> 65M     3675512.sst
> 65M     3675513.sst
> 65M     3675514.sst
> 65M     3675515.sst
> 65M     3675516.sst
> 65M     3675517.sst
> 65M     3675518.sst
> 62M     3675519.sst
>
> to about half of the original size:
>
> -rw-r--r-- 1 167 167  7218886 Oct 13 16:16 3056869.log
> -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
> -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
> -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst
>
> Then when I restarted the monitors one by one before adding compression,
> RocksDB store reduced even further. I am not sure why and what exactly got
> automatically removed from the store:
>
> -rw-r--r-- 1 167 167   841960 Oct 18 03:31 018779.log
> -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
> -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst
>
> Then I have enabled LZ4 and LZ4HC compression in our small production
> cluster (6 nodes, 96 OSDs) on 3 out of 5
> monitors:
> compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
> I specifically went for LZ4 and LZ4HC because of the balance between
> compression/decompression speed and impact on CPU usage. The compression
> doesn't seem to affect the cluster in any negative way, the 3 monitors with
> compression are operating normally. The effect of the compression on
> RocksDB store size and disk writes is quite noticeable:
>
> Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB
> writes over 5 minutes:
>
> -rw-r--r-- 1 167 167  4227337 Oct 18 03:58 3080868.log
> -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
> -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst
>
> # du -hs
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
> 155M
>  /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
> 2471602 be/4 167           6.05 M    473.24 M  0.00 %  0.16 % ceph-mon -n
> mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> 2471633 be/4 167         188.00 K     40.91 M  0.00 %  0.02 % ceph-mon -n
> mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
> 2471603 be/4 167          16.00 K     24.16 M  0.00 %  0.01 % ceph-mon -n
> mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
>
> Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of
> writes over 5 minutes:
>
> -rw-r--r-- 1 167 167  5766659 Oct 18 03:56 3723355.log
> -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst
>
> # du -hs
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
> 60M
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
> 2052031 be/4 167        1040.00 K     83.48 M  0.00 %  0.01 % ceph-mon -n
> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> 2052062 be/4 167           0.00 B     40.79 M  0.00 %  0.01 % ceph-mon -n
> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
> 2052032 be/4 167          16.00 K      4.68 M  0.00 %  0.00 % ceph-mon -n
> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
> 2052052 be/4 167          44.00 K      0.00 B  0.00 %  0.00 % ceph-mon -n
> mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>  --default-mon-cluster-log-to-file=false
> --default-mon-cluster-log-to-stderr=true [msgr-worker-0]
>
> I haven't noticed a major CPU impact. Unfortunately I didn't specifically
> measure CPU time for monitors and , but overall the CPU impact of monitor
> store compression on our systems isn't noticeable. This may be different
> for larger clusters with larger RocksDB datasets, then perhaps
> compression=kLZ4Compression can be enabled by defualt and
> bottommost_compression=kLZ4HCCompression can be optional, in theory this
> should result in lower but much faster compression.
>
> I hope this helps. My plan is to keep the monitors with the current
> settings, i.e. 3 with compression + 2 without compression, until the next
> minor release of Pacific to see whether the monitors with compressed
> RocksDB store can be upgraded without issues.
>
> /Z
>
>
> On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote:
>
> > Hi Zakhar,
> >
> > I took a closer look into what the MONs really do (again with Mykola's
> > help) and why manual compaction is triggered so frequently. With
> > debug_paxos=20 I noticed that paxosservice and paxos triggered manual
> > compactions. So I played with these values:
> >
> > paxos_service_trim_max = 1000 (default 500)
> > paxos_service_trim_min = 500 (default 250)
> > paxos_trim_max = 1000 (default 500)
> > paxos_trim_min = 500 (default 250)
> >
> > This reduced the amount of writes by a factor of 3 or 4, the iotop
> > values are fluctuating a bit, of course. As Mykola suggested I created
> > a tracker issue [1] to increase the default values since they don't
> > seem suitable for a production environment. Although I don't have
> > tested that in production yet I'll ask one of our customers to do that
> > in their secondary cluster (for rbd mirroring) where they also suffer
> > from large mon stores and heavy writes to the mon store. Your findings
> > with the compaction were quite helpful as well, we'll test that as well.
> > Igor mentioned that the default bluestore_rocksdb config for OSDs will
> > enable compression because of positive test results. If we can confirm
> > that compression works well for MONs too, compression could be enabled
> > by default as well.
> >
> > Regards,
> > Eugen
> >
> > https://tracker.ceph.com/issues/63229
> >
> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >
> > > With the help of community members, I managed to enable RocksDB
> > compression
> > > for a test monitor, and it seems to be working well.
> > >
> > > Monitor w/o compression writes about 750 MB to disk in 5 minutes:
> > >
> > >    4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 % ceph-mon
> -n
> > > mon.ceph04 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> > > --default-log-to-stderr=true --default-log-stderr-prefix=debug
> > >  --default-mon-cluster-log-to-file=false
> > > --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> > >
> > > Monitor with LZ4 compression writes about 1/4 of that over the same
> time
> > > period:
> > >
> > > 2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 % ceph-mon
> -n
> > > mon.ceph05 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> > > --default-log-to-stderr=true --default-log-stderr-prefix=debug
> > >  --default-mon-cluster-log-to-file=false
> > > --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> > >
> > > This is caused by the apparent difference in store.db sizes.
> > >
> > > Mon store.db w/o compression:
> > >
> > > # ls -al
> > > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
> > > total 257196
> > > drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
> > > drwx------ 3 167 167     4096 Aug 31 05:22 ..
> > > -rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
> > > -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
> > > -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
> > > -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
> > >
> > > Mon store.db with compression:
> > >
> > > # ls -al
> > > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
> > > total 91188
> > > drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
> > > drwx------ 3 167 167     4096 Oct 16 13:35 ..
> > > -rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
> > > -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
> > >
> > > There are no apparent downsides thus far. If everything works well, I
> > will
> > > try adding compression to other monitors.
> > >
> > > /Z
> > >
> > > On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> > wrote:
> > >
> > >> The issue persists, although to a lesser extent. Any comments from the
> > >> Ceph team please?
> > >>
> > >> /Z
> > >>
> > >> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> > wrote:
> > >>
> > >>> > Some of it is transferable to RocksDB on mons nonetheless.
> > >>>
> > >>> Please point me to relevant Ceph documentation, i.e. a description of
> > how
> > >>> various Ceph monitor and RocksDB tunables affect the operations of
> > >>> monitors, I'll gladly look into it.
> > >>>
> > >>> > Please point me to such recommendations, if they're on
> docs.ceph.com
> > I'll
> > >>> get them updated.
> > >>>
> > >>> This are the recommendations we used when we built our Pacific
> cluster:
> > >>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
> > >>>
> > >>> Our drives are 4x times larger than recommended by this guide. The
> > drives
> > >>> are rated for < 0.5 DWPD, which is more than sufficient for boot
> > drives and
> > >>> storage of rarely modified files. It is not documented or suggested
> > >>> anywhere that monitor processes write several hundred gigabytes of
> > data per
> > >>> day, exceeding the amount of data written by OSDs. Which is why I am
> > not
> > >>> convinced that what we're observing is expected behavior, but it's
> not
> > easy
> > >>> to get a definitive answer from the Ceph community.
> > >>>
> > >>> /Z
> > >>>
> > >>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <
> anthony.datri@xxxxxxxxx>
> > >>> wrote:
> > >>>
> > >>>> Some of it is transferable to RocksDB on mons nonetheless.
> > >>>>
> > >>>> but their specs exceed Ceph hardware recommendations by a good
> margin
> > >>>>
> > >>>>
> > >>>> Please point me to such recommendations, if they're on
> docs.ceph.com
> > I'll
> > >>>> get them updated.
> > >>>>
> > >>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> > wrote:
> > >>>>
> > >>>> Thank you, Anthony. As I explained to you earlier, the article you
> had
> > >>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue
> > >>>> at hand is
> > >>>> not with OSDs but rather monitors and their RocksDB store. Indeed,
> the
> > >>>> drives are not enterprise-grade, but their specs exceed Ceph
> hardware
> > >>>> recommendations by a good margin, they're being used as boot drives
> > only
> > >>>> and aren't supposed to be written to continuously at high rates -
> > which is
> > >>>> what unfortunately is happening. I am trying to determine why it is
> > >>>> happening and how the issue can be alleviated or resolved,
> > unfortunately
> > >>>> monitor RocksDB usage and tunables appear to be not documented at
> all.
> > >>>>
> > >>>> /Z
> > >>>>
> > >>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <
> anthony.datri@xxxxxxxxx
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> cf. Mark's article I sent you re RocksDB tuning.  I suspect that
> with
> > >>>>> Reef you would experience fewer writes.  Universal compaction might
> > also
> > >>>>> help, but in the end this SSD is a client SKU and really not suited
> > for
> > >>>>> enterprise use.  If you had the 1TB SKU you'd get much longer
> > >>>>> life, or you
> > >>>>> could change the overprovisioning on the ones you have.
> > >>>>>
> > >>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> > wrote:
> > >>>>>
> > >>>>> I would very much appreciate it if someone with a better
> > understanding
> > >>>>> of
> > >>>>> monitor internals and use of RocksDB could please chip in.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx