Sorry, I meant extra-entrypoint-arguments:
https://www.spinics.net/lists/ceph-users/msg79251.html
Zitat von Eugen Block <eblock@xxxxxx>:
> You can use the extra container arguments I pointed out a few months
> ago. Those work in my test clusters, although I haven’t enabled that
> in production yet. But it shouldn’t make a difference if it’s a test
> cluster or not. 😉
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
>> Hi,
>>
>>> Did you noticed any downsides with your compression settings so far?
>>
>> None, at least on our systems. Except the part that I haven't found a
way
>> to make the settings persist.
>>
>>> Do you have all mons now on compression?
>>
>> I have 3 out of 5 monitors with compression and 2 without it. The 2
>> monitors with uncompressed RocksDB have much larger disks which do not
>> suffer from writes as much as the other 3. I keep them uncompressed
"just
>> in case", i.e. for the unlikely event if the 3 monitors with compressed
>> RocksDB fail or have any issues specifically because of the
compression. I
>> have to say that this hasn't happened yet, and this precaution may be
>> unnecessary.
>>
>>> Did release updates go through without issues?
>>
>> In our case, container updates overwrite the monitors' configurations
and
>> reset RocksDB options, thus each updated monitor runs with no RocksDB
>> compression until it is added back manually. Other than that, I have not
>> encountered any issues related to compression during the updates.
>>
>>> Do you know if this works also with reef (we see massive writes as well
>> there)?
>>
>> Unfortunately, I can't comment on Reef as we're still using Pacific.
>>
>> /Z
>>
>> On Tue, 16 Apr 2024 at 18:08, Dietmar Rieder <
dietmar.rieder@xxxxxxxxxxx>
>> wrote:
>>
>>> Hi Zakhar, hello List,
>>>
>>> I just wanted to follow up on this and ask a few quesitions:
>>>
>>> Did you noticed any downsides with your compression settings so far?
>>> Do you have all mons now on compression?
>>> Did release updates go through without issues?
>>> Do you know if this works also with reef (we see massive writes as well
>>> there)?
>>>
>>> Can you briefly tabulate the commands you used to persistently set the
>>> compression options?
>>>
>>> Thanks so much,
>>>
>>> Dietmar
>>>
>>>
>>> On 10/18/23 06:14, Zakhar Kirpichenko wrote:
>>>> Many thanks for this, Eugen! I very much appreciate yours and Mykola's
>>>> efforts and insight!
>>>>
>>>> Another thing I noticed was a reduction of RocksDB store after the
>>>> reduction of the total PG number by 30%, from 590-600 MB:
>>>>
>>>> 65M 3675511.sst
>>>> 65M 3675512.sst
>>>> 65M 3675513.sst
>>>> 65M 3675514.sst
>>>> 65M 3675515.sst
>>>> 65M 3675516.sst
>>>> 65M 3675517.sst
>>>> 65M 3675518.sst
>>>> 62M 3675519.sst
>>>>
>>>> to about half of the original size:
>>>>
>>>> -rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log
>>>> -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
>>>> -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
>>>> -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst
>>>>
>>>> Then when I restarted the monitors one by one before adding
compression,
>>>> RocksDB store reduced even further. I am not sure why and what exactly
>>> got
>>>> automatically removed from the store:
>>>>
>>>> -rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log
>>>> -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
>>>> -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst
>>>>
>>>> Then I have enabled LZ4 and LZ4HC compression in our small production
>>>> cluster (6 nodes, 96 OSDs) on 3 out of 5
>>>> monitors:
>>> compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
>>>> I specifically went for LZ4 and LZ4HC because of the balance between
>>>> compression/decompression speed and impact on CPU usage. The
compression
>>>> doesn't seem to affect the cluster in any negative way, the 3 monitors
>>> with
>>>> compression are operating normally. The effect of the compression on
>>>> RocksDB store size and disk writes is quite noticeable:
>>>>
>>>> Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530
MB
>>>> writes over 5 minutes:
>>>>
>>>> -rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log
>>>> -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
>>>> -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst
>>>>
>>>> # du -hs
>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
>>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
>>>> 155M
>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
>>>> 2471602 be/4 167 6.05 M 473.24 M 0.00 % 0.16 %
ceph-mon -n
>>>> mon.ceph04 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>>> 2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 %
ceph-mon -n
>>>> mon.ceph04 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
>>>> 2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 %
ceph-mon -n
>>>> mon.ceph04 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
>>>>
>>>> Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB
of
>>>> writes over 5 minutes:
>>>>
>>>> -rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log
>>>> -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst
>>>>
>>>> # du -hs
>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
>>>> iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
>>>> 60M
>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
>>>> 2052031 be/4 167 1040.00 K 83.48 M 0.00 % 0.01 %
ceph-mon -n
>>>> mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>>> 2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 %
ceph-mon -n
>>>> mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [ms_dispatch]
>>>> 2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 %
ceph-mon -n
>>>> mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:high0]
>>>> 2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 %
ceph-mon -n
>>>> mon.ceph03 -f --setuser ceph --setgroup ceph
--default-log-to-file=false
>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>> --default-mon-cluster-log-to-file=false
>>>> --default-mon-cluster-log-to-stderr=true [msgr-worker-0]
>>>>
>>>> I haven't noticed a major CPU impact. Unfortunately I didn't
specifically
>>>> measure CPU time for monitors and , but overall the CPU impact of
monitor
>>>> store compression on our systems isn't noticeable. This may be
different
>>>> for larger clusters with larger RocksDB datasets, then perhaps
>>>> compression=kLZ4Compression can be enabled by defualt and
>>>> bottommost_compression=kLZ4HCCompression can be optional, in theory
this
>>>> should result in lower but much faster compression.
>>>>
>>>> I hope this helps. My plan is to keep the monitors with the current
>>>> settings, i.e. 3 with compression + 2 without compression, until the
next
>>>> minor release of Pacific to see whether the monitors with compressed
>>>> RocksDB store can be upgraded without issues.
>>>>
>>>> /Z
>>>>
>>>>
>>>> On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote:
>>>>
>>>>> Hi Zakhar,
>>>>>
>>>>> I took a closer look into what the MONs really do (again with
Mykola's
>>>>> help) and why manual compaction is triggered so frequently. With
>>>>> debug_paxos=20 I noticed that paxosservice and paxos triggered manual
>>>>> compactions. So I played with these values:
>>>>>
>>>>> paxos_service_trim_max = 1000 (default 500)
>>>>> paxos_service_trim_min = 500 (default 250)
>>>>> paxos_trim_max = 1000 (default 500)
>>>>> paxos_trim_min = 500 (default 250)
>>>>>
>>>>> This reduced the amount of writes by a factor of 3 or 4, the iotop
>>>>> values are fluctuating a bit, of course. As Mykola suggested I
created
>>>>> a tracker issue [1] to increase the default values since they don't
>>>>> seem suitable for a production environment. Although I don't have
>>>>> tested that in production yet I'll ask one of our customers to do
that
>>>>> in their secondary cluster (for rbd mirroring) where they also suffer
>>>>> from large mon stores and heavy writes to the mon store. Your
findings
>>>>> with the compaction were quite helpful as well, we'll test that as
well.
>>>>> Igor mentioned that the default bluestore_rocksdb config for OSDs
will
>>>>> enable compression because of positive test results. If we can
confirm
>>>>> that compression works well for MONs too, compression could be
enabled
>>>>> by default as well.
>>>>>
>>>>> Regards,
>>>>> Eugen
>>>>>
>>>>> https://tracker.ceph.com/issues/63229
>>>>>
>>>>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>>>>>
>>>>>> With the help of community members, I managed to enable RocksDB
>>>>> compression
>>>>>> for a test monitor, and it seems to be working well.
>>>>>>
>>>>>> Monitor w/o compression writes about 750 MB to disk in 5 minutes:
>>>>>>
>>>>>> 4854 be/4 167 4.97 M 755.02 M 0.00 % 0.24 %
>>> ceph-mon -n
>>>>>> mon.ceph04 -f --setuser ceph --setgroup ceph
>>> --default-log-to-file=false
>>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>>>> --default-mon-cluster-log-to-file=false
>>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>>>>>
>>>>>> Monitor with LZ4 compression writes about 1/4 of that over the same
>>> time
>>>>>> period:
>>>>>>
>>>>>> 2034728 be/4 167 172.00 K 199.27 M 0.00 % 0.06 %
ceph-mon
>>> -n
>>>>>> mon.ceph05 -f --setuser ceph --setgroup ceph
>>> --default-log-to-file=false
>>>>>> --default-log-to-stderr=true --default-log-stderr-prefix=debug
>>>>>> --default-mon-cluster-log-to-file=false
>>>>>> --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
>>>>>>
>>>>>> This is caused by the apparent difference in store.db sizes.
>>>>>>
>>>>>> Mon store.db w/o compression:
>>>>>>
>>>>>> # ls -al
>>>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
>>>>>> total 257196
>>>>>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
>>>>>> drwx------ 3 167 167 4096 Aug 31 05:22 ..
>>>>>> -rw-r--r-- 1 167 167 1517623 Oct 16 14:00 3073035.log
>>>>>> -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
>>>>>> -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
>>>>>> -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
>>>>>>
>>>>>> Mon store.db with compression:
>>>>>>
>>>>>> # ls -al
>>>>>>
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
>>>>>> total 91188
>>>>>> drwxr-xr-x 2 167 167 4096 Oct 16 14:00 .
>>>>>> drwx------ 3 167 167 4096 Oct 16 13:35 ..
>>>>>> -rw-r--r-- 1 167 167 1760114 Oct 16 14:00 012693.log
>>>>>> -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
>>>>>>
>>>>>> There are no apparent downsides thus far. If everything works well,
I
>>>>> will
>>>>>> try adding compression to other monitors.
>>>>>>
>>>>>> /Z
>>>>>>
>>>>>> On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>>> The issue persists, although to a lesser extent. Any comments from
the
>>>>>>> Ceph team please?
>>>>>>>
>>>>>>> /Z
>>>>>>>
>>>>>>> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx
>
>>>>> wrote:
>>>>>>>
>>>>>>>>> Some of it is transferable to RocksDB on mons nonetheless.
>>>>>>>>
>>>>>>>> Please point me to relevant Ceph documentation, i.e. a
description of
>>>>> how
>>>>>>>> various Ceph monitor and RocksDB tunables affect the operations of
>>>>>>>> monitors, I'll gladly look into it.
>>>>>>>>
>>>>>>>>> Please point me to such recommendations, if they're on
>>> docs.ceph.com
>>>>> I'll
>>>>>>>> get them updated.
>>>>>>>>
>>>>>>>> This are the recommendations we used when we built our Pacific
>>> cluster:
>>>>>>>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
>>>>>>>>
>>>>>>>> Our drives are 4x times larger than recommended by this guide. The
>>>>> drives
>>>>>>>> are rated for < 0.5 DWPD, which is more than sufficient for boot
>>>>> drives and
>>>>>>>> storage of rarely modified files. It is not documented or
suggested
>>>>>>>> anywhere that monitor processes write several hundred gigabytes of
>>>>> data per
>>>>>>>> day, exceeding the amount of data written by OSDs. Which is why I
am
>>>>> not
>>>>>>>> convinced that what we're observing is expected behavior, but it's
>>> not
>>>>> easy
>>>>>>>> to get a definitive answer from the Ceph community.
>>>>>>>>
>>>>>>>> /Z
>>>>>>>>
>>>>>>>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <
>>> anthony.datri@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Some of it is transferable to RocksDB on mons nonetheless.
>>>>>>>>>
>>>>>>>>> but their specs exceed Ceph hardware recommendations by a good
>>> margin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please point me to such recommendations, if they're on
>>> docs.ceph.com
>>>>> I'll
>>>>>>>>> get them updated.
>>>>>>>>>
>>>>>>>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you, Anthony. As I explained to you earlier, the article
you
>>> had
>>>>>>>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue
>>>>>>>>> at hand is
>>>>>>>>> not with OSDs but rather monitors and their RocksDB store.
Indeed,
>>> the
>>>>>>>>> drives are not enterprise-grade, but their specs exceed Ceph
>>> hardware
>>>>>>>>> recommendations by a good margin, they're being used as boot
drives
>>>>> only
>>>>>>>>> and aren't supposed to be written to continuously at high rates -
>>>>> which is
>>>>>>>>> what unfortunately is happening. I am trying to determine why it
is
>>>>>>>>> happening and how the issue can be alleviated or resolved,
>>>>> unfortunately
>>>>>>>>> monitor RocksDB usage and tunables appear to be not documented at
>>> all.
>>>>>>>>>
>>>>>>>>> /Z
>>>>>>>>>
>>>>>>>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <
>>> anthony.datri@xxxxxxxxx
>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> cf. Mark's article I sent you re RocksDB tuning. I suspect that
>>> with
>>>>>>>>>> Reef you would experience fewer writes. Universal compaction
might
>>>>> also
>>>>>>>>>> help, but in the end this SSD is a client SKU and really not
suited
>>>>> for
>>>>>>>>>> enterprise use. If you had the 1TB SKU you'd get much longer
>>>>>>>>>> life, or you
>>>>>>>>>> could change the overprovisioning on the ones you have.
>>>>>>>>>>
>>>>>>>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx
>
>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I would very much appreciate it if someone with a better
>>>>> understanding
>>>>>>>>>> of
>>>>>>>>>> monitor internals and use of RocksDB could please chip in.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>
>>> --
>>> _________________________________________________________
>>> D i e t m a r R i e d e r
>>> Innsbruck Medical University
>>> Biocenter - Institute of Bioinformatics
>>> Innrain 80, 6020 Innsbruck
>>> Phone: +43 512 9003 71402 | Mobile: +43 676 8716 72402
>>> Email: dietmar.rieder@xxxxxxxxxxx
>>> Web: http://www.icbi.at
>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx