Re: Ceph 16.2.x mon compactions, disk writes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Many thanks for this, Eugen! I very much appreciate yours and Mykola's
efforts and insight!

Another thing I noticed was a reduction of RocksDB store after the
reduction of the total PG number by 30%, from 590-600 MB:

65M     3675511.sst
65M     3675512.sst
65M     3675513.sst
65M     3675514.sst
65M     3675515.sst
65M     3675516.sst
65M     3675517.sst
65M     3675518.sst
62M     3675519.sst

to about half of the original size:

-rw-r--r-- 1 167 167  7218886 Oct 13 16:16 3056869.log
-rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
-rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
-rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst

Then when I restarted the monitors one by one before adding compression,
RocksDB store reduced even further. I am not sure why and what exactly got
automatically removed from the store:

-rw-r--r-- 1 167 167   841960 Oct 18 03:31 018779.log
-rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
-rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst

Then I have enabled LZ4 and LZ4HC compression in our small production
cluster (6 nodes, 96 OSDs) on 3 out of 5
monitors: compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
I specifically went for LZ4 and LZ4HC because of the balance between
compression/decompression speed and impact on CPU usage. The compression
doesn't seem to affect the cluster in any negative way, the 3 monitors with
compression are operating normally. The effect of the compression on
RocksDB store size and disk writes is quite noticeable:

Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB
writes over 5 minutes:

-rw-r--r-- 1 167 167  4227337 Oct 18 03:58 3080868.log
-rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
-rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst

# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
155M
 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
2471602 be/4 167           6.05 M    473.24 M  0.00 %  0.16 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2471633 be/4 167         188.00 K     40.91 M  0.00 %  0.02 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2471603 be/4 167          16.00 K     24.16 M  0.00 %  0.01 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]

Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of
writes over 5 minutes:

-rw-r--r-- 1 167 167  5766659 Oct 18 03:56 3723355.log
-rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst

# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
60M
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
2052031 be/4 167        1040.00 K     83.48 M  0.00 %  0.01 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2052062 be/4 167           0.00 B     40.79 M  0.00 %  0.01 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2052032 be/4 167          16.00 K      4.68 M  0.00 %  0.00 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]
2052052 be/4 167          44.00 K      0.00 B  0.00 %  0.00 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [msgr-worker-0]

I haven't noticed a major CPU impact. Unfortunately I didn't specifically
measure CPU time for monitors and , but overall the CPU impact of monitor
store compression on our systems isn't noticeable. This may be different
for larger clusters with larger RocksDB datasets, then perhaps
compression=kLZ4Compression can be enabled by defualt and
bottommost_compression=kLZ4HCCompression can be optional, in theory this
should result in lower but much faster compression.

I hope this helps. My plan is to keep the monitors with the current
settings, i.e. 3 with compression + 2 without compression, until the next
minor release of Pacific to see whether the monitors with compressed
RocksDB store can be upgraded without issues.

/Z


On Tue, 17 Oct 2023 at 23:45, Eugen Block <eblock@xxxxxx> wrote:

> Hi Zakhar,
>
> I took a closer look into what the MONs really do (again with Mykola's
> help) and why manual compaction is triggered so frequently. With
> debug_paxos=20 I noticed that paxosservice and paxos triggered manual
> compactions. So I played with these values:
>
> paxos_service_trim_max = 1000 (default 500)
> paxos_service_trim_min = 500 (default 250)
> paxos_trim_max = 1000 (default 500)
> paxos_trim_min = 500 (default 250)
>
> This reduced the amount of writes by a factor of 3 or 4, the iotop
> values are fluctuating a bit, of course. As Mykola suggested I created
> a tracker issue [1] to increase the default values since they don't
> seem suitable for a production environment. Although I don't have
> tested that in production yet I'll ask one of our customers to do that
> in their secondary cluster (for rbd mirroring) where they also suffer
> from large mon stores and heavy writes to the mon store. Your findings
> with the compaction were quite helpful as well, we'll test that as well.
> Igor mentioned that the default bluestore_rocksdb config for OSDs will
> enable compression because of positive test results. If we can confirm
> that compression works well for MONs too, compression could be enabled
> by default as well.
>
> Regards,
> Eugen
>
> https://tracker.ceph.com/issues/63229
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > With the help of community members, I managed to enable RocksDB
> compression
> > for a test monitor, and it seems to be working well.
> >
> > Monitor w/o compression writes about 750 MB to disk in 5 minutes:
> >
> >    4854 be/4 167           4.97 M    755.02 M  0.00 %  0.24 % ceph-mon -n
> > mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> > --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >  --default-mon-cluster-log-to-file=false
> > --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> >
> > Monitor with LZ4 compression writes about 1/4 of that over the same time
> > period:
> >
> > 2034728 be/4 167         172.00 K    199.27 M  0.00 %  0.06 % ceph-mon -n
> > mon.ceph05 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> > --default-log-to-stderr=true --default-log-stderr-prefix=debug
> >  --default-mon-cluster-log-to-file=false
> > --default-mon-cluster-log-to-stderr=true [rocksdb:low0]
> >
> > This is caused by the apparent difference in store.db sizes.
> >
> > Mon store.db w/o compression:
> >
> > # ls -al
> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db
> > total 257196
> > drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
> > drwx------ 3 167 167     4096 Aug 31 05:22 ..
> > -rw-r--r-- 1 167 167  1517623 Oct 16 14:00 3073035.log
> > -rw-r--r-- 1 167 167 67285944 Oct 16 14:00 3073037.sst
> > -rw-r--r-- 1 167 167 67402325 Oct 16 14:00 3073038.sst
> > -rw-r--r-- 1 167 167 62364991 Oct 16 14:00 3073039.sst
> >
> > Mon store.db with compression:
> >
> > # ls -al
> > /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph05/store.db
> > total 91188
> > drwxr-xr-x 2 167 167     4096 Oct 16 14:00 .
> > drwx------ 3 167 167     4096 Oct 16 13:35 ..
> > -rw-r--r-- 1 167 167  1760114 Oct 16 14:00 012693.log
> > -rw-r--r-- 1 167 167 52236087 Oct 16 14:00 012695.sst
> >
> > There are no apparent downsides thus far. If everything works well, I
> will
> > try adding compression to other monitors.
> >
> > /Z
> >
> > On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >
> >> The issue persists, although to a lesser extent. Any comments from the
> >> Ceph team please?
> >>
> >> /Z
> >>
> >> On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >>
> >>> > Some of it is transferable to RocksDB on mons nonetheless.
> >>>
> >>> Please point me to relevant Ceph documentation, i.e. a description of
> how
> >>> various Ceph monitor and RocksDB tunables affect the operations of
> >>> monitors, I'll gladly look into it.
> >>>
> >>> > Please point me to such recommendations, if they're on docs.ceph.com
> I'll
> >>> get them updated.
> >>>
> >>> This are the recommendations we used when we built our Pacific cluster:
> >>> https://docs.ceph.com/en/pacific/start/hardware-recommendations/
> >>>
> >>> Our drives are 4x times larger than recommended by this guide. The
> drives
> >>> are rated for < 0.5 DWPD, which is more than sufficient for boot
> drives and
> >>> storage of rarely modified files. It is not documented or suggested
> >>> anywhere that monitor processes write several hundred gigabytes of
> data per
> >>> day, exceeding the amount of data written by OSDs. Which is why I am
> not
> >>> convinced that what we're observing is expected behavior, but it's not
> easy
> >>> to get a definitive answer from the Ceph community.
> >>>
> >>> /Z
> >>>
> >>> On Fri, 13 Oct 2023 at 20:35, Anthony D'Atri <anthony.datri@xxxxxxxxx>
> >>> wrote:
> >>>
> >>>> Some of it is transferable to RocksDB on mons nonetheless.
> >>>>
> >>>> but their specs exceed Ceph hardware recommendations by a good margin
> >>>>
> >>>>
> >>>> Please point me to such recommendations, if they're on docs.ceph.com
> I'll
> >>>> get them updated.
> >>>>
> >>>> On Oct 13, 2023, at 13:34, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >>>>
> >>>> Thank you, Anthony. As I explained to you earlier, the article you had
> >>>> sent is about RocksDB tuning for Bluestore OSDs, while the issue
> >>>> at hand is
> >>>> not with OSDs but rather monitors and their RocksDB store. Indeed, the
> >>>> drives are not enterprise-grade, but their specs exceed Ceph hardware
> >>>> recommendations by a good margin, they're being used as boot drives
> only
> >>>> and aren't supposed to be written to continuously at high rates -
> which is
> >>>> what unfortunately is happening. I am trying to determine why it is
> >>>> happening and how the issue can be alleviated or resolved,
> unfortunately
> >>>> monitor RocksDB usage and tunables appear to be not documented at all.
> >>>>
> >>>> /Z
> >>>>
> >>>> On Fri, 13 Oct 2023 at 20:11, Anthony D'Atri <anthony.datri@xxxxxxxxx
> >
> >>>> wrote:
> >>>>
> >>>>> cf. Mark's article I sent you re RocksDB tuning.  I suspect that with
> >>>>> Reef you would experience fewer writes.  Universal compaction might
> also
> >>>>> help, but in the end this SSD is a client SKU and really not suited
> for
> >>>>> enterprise use.  If you had the 1TB SKU you'd get much longer
> >>>>> life, or you
> >>>>> could change the overprovisioning on the ones you have.
> >>>>>
> >>>>> On Oct 13, 2023, at 12:30, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> wrote:
> >>>>>
> >>>>> I would very much appreciate it if someone with a better
> understanding
> >>>>> of
> >>>>> monitor internals and use of RocksDB could please chip in.
> >>>>>
> >>>>>
> >>>>>
> >>>>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux