Re: [EXTERN] Re: Ceph 16.2.x mon compactions, disk writes

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Tue, 16 Apr 2024 20:12:38 +0300

I remember that I found the part which said "if something goes wrong,
monitors will fail" rather discouraging :-)

/Z

On Tue, 16 Apr 2024 at 18:59, Eugen Block <eblock@xxxxxx> wrote:

> Sorry, I meant extra-entrypoint-arguments:
>
> https://www.spinics.net/lists/ceph-users/msg79251.html
>
> Zitat von Eugen Block <eblock@xxxxxx>:
>
> > You can use the extra container arguments I pointed out a few months
> > ago. Those work in my test clusters, although I haven’t enabled that
> > in production yet. But it shouldn’t make a difference if it’s a test
> > cluster or not. 😉
> >
> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >
> >> Hi,
> >>
> >>> Did you noticed any downsides with your compression settings so far?
> >>
> >> None, at least on our systems. Except the part that I haven't found a
> way
> >> to make the settings persist.
> >>
> >>> Do you have all mons now on compression?
> >>
> >> I have 3 out of 5 monitors with compression and 2 without it. The 2
> >> monitors with uncompressed RocksDB have much larger disks which do not
> >> suffer from writes as much as the other 3. I keep them uncompressed
> "just
> >> in case", i.e. for the unlikely event if the 3 monitors with compressed
> >> RocksDB fail or have any issues specifically because of the
> compression. I
> >> have to say that this hasn't happened yet, and this precaution may be
> >> unnecessary.
> >>
> >>> Did release updates go through without issues?
> >>
> >> In our case, container updates overwrite the monitors' configurations
> and
> >> reset RocksDB options, thus each updated monitor runs with no RocksDB
> >> compression until it is added back manually. Other than that, I have not
> >> encountered any issues related to compression during the updates.
> >>
> >>> Do you know if this works also with reef (we see massive writes as well
> >> there)?
> >>
> >> Unfortunately, I can't comment on Reef as we're still using Pacific.
> >>
> >> /Z
> >>
> >> On Tue, 16 Apr 2024 at 18:08, Dietmar Rieder <
> dietmar.rieder@xxxxxxxxxxx>
> >> wrote:
> >>
> >>> Hi Zakhar, hello List,
> >>>
> >>> I just wanted to follow up on this and ask a few quesitions:
> >>>
> >>> Did you noticed any downsides with your compression settings so far?
> >>> Do you have all mons now on compression?
> >>> Did release updates go through without issues?
> >>> Do you know if this works also with reef (we see massive writes as well
> >>> there)?
> >>>
> >>> Can you briefly tabulate the commands you used to persistently set the
> >>> compression options?
> >>>
> >>> Thanks so much,
> >>>
> >>>   Dietmar
> >>>
> >>>
> >>> On 10/18/23 06:14, Zakhar Kirpichenko wrote:
> >>>> Many thanks for this, Eugen! I very much appreciate yours and Mykola's
> >>>> efforts and insight!
> >>>>
> >>>> Another thing I noticed was a reduction of RocksDB store after the
> >>>> reduction of the total PG number by 30%, from 590-600 MB:
> >>>>
> >>>> 65M     3675511.sst
> >>>> 65M     3675512.sst
> >>>> 65M     3675513.sst
> >>>> 65M     3675514.sst
> >>>> 65M     3675515.sst
> >>>> 65M     3675516.sst
> >>>> 65M     3675517.sst
> >>>> 65M     3675518.sst
> >>>> 62M     3675519.sst
> >>>>
> >>>> to about half of the original size:
> >>>>
> >>>> -rw-r--r-- 1 167 167  7218886 Oct 13 16:16 3056869.log
> >>>> -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
> >>>> -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
> >>>> -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst
> >>>>
> >>>> Then when I restarted the monitors one by one before adding
> compression,
> >>>> RocksDB store reduced even further. I am not sure why and what exactly
> >>> got
> >>>> automatically removed from the store:
> >>>>
> >>>> -rw-r--r-- 1 167 167   841960 Oct 18 03:31 018779.log
> >>>> -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
> >>>> -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst
> >>>>
> >>>> Then I have enabled LZ4 and LZ4HC compression in our small production
> >>>> cluster (6 nodes, 96 OSDs) on 3 out of 5
> >>>> monitors:
> >>> compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
> >>>> I specifically went for LZ4 and LZ4HC because of the balance between
> >>>> compression/decompression speed and impact on CPU usage. The
> compression
> >>>> doesn't seem to affect the cluster in any negative way, the 3 monitors
> >>> with
> >>>> compression are operating normally. The effect of the compression on
> >>>> RocksDB store size and disk writes is quite noticeable:
> >>>>
> >>>> Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530
> MB
> >>>> writes over 5 minutes:
> >>>>
> >>>> -rw-r--r-- 1 167 167  4227337 Oct 18 03:58 3080868.log
> >>>> -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
> >>>> -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst
> >>>>
> >>>> # du -hs
> >>>>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
> >>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
> >>>> 155M
> >>>>
>  /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
> >>>> 2471602 be/4 167           6.05 M    473.24 M  0.00 %  0.16 %
> ceph-mon -n
> >>>> mon.ceph04 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> >>>> 2471633 be/4 167         188.00 K     40.91 M  0.00 %  0.02 %
> ceph-mon -n
> >>>> mon.ceph04 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
> >>>> 2471603 be/4 167          16.00 K     24.16 M  0.00 %  0.01 %
> ceph-mon -n
> >>>> mon.ceph04 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
> >>>>
> >>>> Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB
> of
> >>>> writes over 5 minutes:
> >>>>
> >>>> -rw-r--r-- 1 167 167  5766659 Oct 18 03:56 3723355.log
> >>>> -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst
> >>>>
> >>>> # du -hs
> >>>>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
> >>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
> >>>> 60M
> >>>>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
> >>>> 2052031 be/4 167        1040.00 K     83.48 M  0.00 %  0.01 %
> ceph-mon -n
> >>>> mon.ceph03 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> >>>> 2052062 be/4 167           0.00 B     40.79 M  0.00 %  0.01 %
> ceph-mon -n
> >>>> mon.ceph03 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
> >>>> 2052032 be/4 167          16.00 K      4.68 M  0.00 %  0.00 %
> ceph-mon -n
> >>>> mon.ceph03 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
> >>>> 2052052 be/4 167          44.00 K      0.00 B  0.00 %  0.00 %
> ceph-mon -n
> >>>> mon.ceph03 -f --setuser ceph --setgroup ceph
> --default-log-to-file=false
> >>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>   --default-mon-cluster-log-to-file=false
> >>>> --default-mon-cluster-log-to-stderr=true [msgr-worker-0]
> >>>>
> >>>> I haven't noticed a major CPU impact. Unfortunately I didn't
> specifically
> >>>> measure CPU time for monitors and , but overall the CPU impact of
> monitor
> >>>> store compression on our systems isn't noticeable. This may be
> different
> >>>> for larger clusters with larger RocksDB datasets, then perhaps
> >>>> compression=kLZ4Compression can be enabled by defualt and
> >>>> bottommost_compression=kLZ4HCCompression can be optional, in theory
> this
> >>>> should result in lower but much faster compression.
> >>>>
> >>>> I hope this helps. My plan is to keep the monitors with the current
> >>>> settings, i.e. 3 with compression + 2 without compression, until the
> next
> >>>> minor release of Pacific to see whether the monitors with compressed
> >>>> RocksDB store can be upgraded without issues.
> >>>>
> >>>> /Z
> >>>>
> >>>>
> >>>> On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote:
> >>>>
> >>>>> Hi Zakhar,
> >>>>>
> >>>>> I took a closer look into what the MONs really do (again with
> Mykola's
> >>>>> help) and why manual compaction is triggered so frequently. With
> >>>>> debug_paxos=20 I noticed that paxosservice and paxos triggered manual
> >>>>> compactions. So I played with these values:
> >>>>>
> >>>>> paxos_service_trim_max = 1000 (default 500)
> >>>>> paxos_service_trim_min = 500 (default 250)
> >>>>> paxos_trim_max = 1000 (default 500)
> >>>>> paxos_trim_min = 500 (default 250)
> >>>>>
> >>>>> This reduced the amount of writes by a factor of 3 or 4, the iotop
> >>>>> values are fluctuating a bit, of course. As Mykola suggested I
> created
> >>>>> a tracker issue [1] to increase the default values since they don't
> >>>>> seem suitable for a production environment. Although I don't have
> >>>>> tested that in production yet I'll ask one of our customers to do
> that
> >>>>> in their secondary cluster (for rbd mirroring) where they also suffer
> >>>>> from large mon stores and heavy writes to the mon store. Your
> findings
> >>>>> with the compaction were quite helpful as well, we'll test that as
> well.
> >>>>> Igor mentioned that the default bluestore_rocksdb config for OSDs
> will
> >>>>> enable compression because of positive test results. If we can
> confirm
> >>>>> that compression works well for MONs too, compression could be
> enabled
> >>>>> by default as well.
> >>>>>
> >>>>> Regards,
> >>>>> Eugen
> >>>>>
> >>>>> https://tracker.ceph.com/issues/63229
> >>>>>
> >>>>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >>>>>
> >>>>>> With the help of community members, I managed to enable RocksDB
> >>>>> compression
> >>>>>> for a test monitor, and it seems to be working well.
> >>>>>>
> >>>>>> Monitor w/o compression writes about 750 MB to disk in 5 minutes:
> >>>>>>
> >>>>>>     4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 %
> >>> ceph-mon -n
> >>>>>> mon.ceph04 -f --setuser ceph --setgroup ceph
> >>> --default-log-to-file=false
> >>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>>>   --default-mon-cluster-log-to-file=false
> >>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> >>>>>>
> >>>>>> Monitor with LZ4 compression writes about 1/4 of that over the same
> >>> time
> >>>>>> period:
> >>>>>>
> >>>>>> 2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 %
> ceph-mon
> >>> -n
> >>>>>> mon.ceph05 -f --setuser ceph --setgroup ceph
> >>> --default-log-to-file=false
> >>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >>>>>>   --default-mon-cluster-log-to-file=false
> >>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> >>>>>>
> >>>>>> This is caused by the apparent difference in store.db sizes.
> >>>>>>
> >>>>>> Mon store.db w/o compression:
> >>>>>>
> >>>>>> # ls -al
> >>>>>>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
> >>>>>> total 257196
> >>>>>> drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
> >>>>>> drwx------ 3 167 167     4096 Aug 31 05:22 ..
> >>>>>> -rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
> >>>>>> -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
> >>>>>> -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
> >>>>>> -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
> >>>>>>
> >>>>>> Mon store.db with compression:
> >>>>>>
> >>>>>> # ls -al
> >>>>>>
> /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
> >>>>>> total 91188
> >>>>>> drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
> >>>>>> drwx------ 3 167 167     4096 Oct 16 13:35 ..
> >>>>>> -rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
> >>>>>> -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
> >>>>>>
> >>>>>> There are no apparent downsides thus far. If everything works well,
> I
> >>>>> will
> >>>>>> try adding compression to other monitors.
> >>>>>>
> >>>>>> /Z
> >>>>>>
> >>>>>> On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >>>>> wrote:
> >>>>>>
> >>>>>>> The issue persists, although to a lesser extent. Any comments from
> the
> >>>>>>> Ceph team please?
> >>>>>>>
> >>>>>>> /Z
> >>>>>>>
> >>>>>>> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx
> >
> >>>>> wrote:
> >>>>>>>
> >>>>>>>>> Some of it is transferable to RocksDB on mons nonetheless.
> >>>>>>>>
> >>>>>>>> Please point me to relevant Ceph documentation, i.e. a
> description of
> >>>>> how
> >>>>>>>> various Ceph monitor and RocksDB tunables affect the operations of
> >>>>>>>> monitors, I'll gladly look into it.
> >>>>>>>>
> >>>>>>>>> Please point me to such recommendations, if they're on
> >>> docs.ceph.com
> >>>>> I'll
> >>>>>>>> get them updated.
> >>>>>>>>
> >>>>>>>> This are the recommendations we used when we built our Pacific
> >>> cluster:
> >>>>>>>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
> >>>>>>>>
> >>>>>>>> Our drives are 4x times larger than recommended by this guide. The
> >>>>> drives
> >>>>>>>> are rated for < 0.5 DWPD, which is more than sufficient for boot
> >>>>> drives and
> >>>>>>>> storage of rarely modified files. It is not documented or
> suggested
> >>>>>>>> anywhere that monitor processes write several hundred gigabytes of
> >>>>> data per
> >>>>>>>> day, exceeding the amount of data written by OSDs. Which is why I
> am
> >>>>> not
> >>>>>>>> convinced that what we're observing is expected behavior, but it's
> >>> not
> >>>>> easy
> >>>>>>>> to get a definitive answer from the Ceph community.
> >>>>>>>>
> >>>>>>>> /Z
> >>>>>>>>
> >>>>>>>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <
> >>> anthony.datri@xxxxxxxxx>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Some of it is transferable to RocksDB on mons nonetheless.
> >>>>>>>>>
> >>>>>>>>> but their specs exceed Ceph hardware recommendations by a good
> >>> margin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Please point me to such recommendations, if they're on
> >>> docs.ceph.com
> >>>>> I'll
> >>>>>>>>> get them updated.
> >>>>>>>>>
> >>>>>>>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Thank you, Anthony. As I explained to you earlier, the article
> you
> >>> had
> >>>>>>>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue
> >>>>>>>>> at hand is
> >>>>>>>>> not with OSDs but rather monitors and their RocksDB store.
> Indeed,
> >>> the
> >>>>>>>>> drives are not enterprise-grade, but their specs exceed Ceph
> >>> hardware
> >>>>>>>>> recommendations by a good margin, they're being used as boot
> drives
> >>>>> only
> >>>>>>>>> and aren't supposed to be written to continuously at high rates -
> >>>>> which is
> >>>>>>>>> what unfortunately is happening. I am trying to determine why it
> is
> >>>>>>>>> happening and how the issue can be alleviated or resolved,
> >>>>> unfortunately
> >>>>>>>>> monitor RocksDB usage and tunables appear to be not documented at
> >>> all.
> >>>>>>>>>
> >>>>>>>>> /Z
> >>>>>>>>>
> >>>>>>>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <
> >>> anthony.datri@xxxxxxxxx
> >>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> cf. Mark's article I sent you re RocksDB tuning.  I suspect that
> >>> with
> >>>>>>>>>> Reef you would experience fewer writes.  Universal compaction
> might
> >>>>> also
> >>>>>>>>>> help, but in the end this SSD is a client SKU and really not
> suited
> >>>>> for
> >>>>>>>>>> enterprise use.  If you had the 1TB SKU you'd get much longer
> >>>>>>>>>> life, or you
> >>>>>>>>>> could change the overprovisioning on the ones you have.
> >>>>>>>>>>
> >>>>>>>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx
> >
> >>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I would very much appreciate it if someone with a better
> >>>>> understanding
> >>>>>>>>>> of
> >>>>>>>>>> monitor internals and use of RocksDB could please chip in.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>
> >>>
> >>> --
> >>> _________________________________________________________
> >>> D i e t m a r  R i e d e r
> >>> Innsbruck Medical University
> >>> Biocenter - Institute of Bioinformatics
> >>> Innrain 80, 6020 Innsbruck
> >>> Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402
> >>> Email: dietmar.rieder@xxxxxxxxxxx
> >>> Web:   http://www.icbi.at
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx