Re: Ceph 16.2.x mon compactions, disk writes

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Fri, 13 Oct 2023 19:30:31 +0300

An interesting find:

I reduced the number of PGs for some of the less utilized pools, which
brought the total number of PGs in the cluster down from 2400 to 1664. The
cluster is healthy, the only change was the 30% reduction of PGs, but mons
now have a much smaller store.db, have much fewer "manual compaction"
events and write significantly less data.

Store.db is about 1/2 smaller, ~260 MB in 3 .sst files compared to 590-600
MB in 9 .sst files before the PG changes:

total 258012
drwxr-xr-x 2 167 167     4096 Oct 13 16:15 .
drwx------ 3 167 167     4096 Aug 31 05:22 ..
-rw-r--r-- 1 167 167  7218886 Oct 13 16:16 3056869.log
-rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
-rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
-rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst
-rw-r--r-- 1 167 167       17 Sep 18 11:53 CURRENT
-rw-r--r-- 1 167 167       37 Nov  3  2021 IDENTITY
-rw-r--r-- 1 167 167        0 Nov  3  2021 LOCK
-rw-r--r-- 1 167 167 27039408 Oct 13 16:15 MANIFEST-2785821
-rw-r--r-- 1 167 167     5287 Sep  1 04:39 OPTIONS-2710412
-rw-r--r-- 1 167 167     5287 Sep 18 11:53 OPTIONS-2785824

"Manual compaction" events now run half as often compared to before the
change.

Before the change, compaction events per hour:

# docker logs a4615a23b4c6 2>&1| grep -i 2023-10-13T10 | grep -ci "manual
compaction from"
88

After the change, compaction events per hour:

# docker logs a4615a23b4c6 2>&1| grep -i 2023-10-13T15 | grep -ci "manual
compaction from"
45

I ran several iotop measurements, mons consistently write 550-750 MB to
disk every 5 minutes compared to 1.5-2.5 GB every 5 min before the changes:

   4919 be/4 167           7.29 M    754.04 M  0.00 %  0.17 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
   4919 be/4 167           8.12 M    554.53 M  0.00 %  0.12 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
   4919 be/4 167         532.00 K    750.40 M  0.00 %  0.16 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]

It is a significant reduction of store.db and associated disk writes.

I would very much appreciate it if someone with a better understanding of
monitor internals and use of RocksDB could please chip in.

/Z

On Wed, 11 Oct 2023 at 19:00, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:

> Thank you, Frank. This confirms that monitors indeed do this, and
>
> Our boot drives in 3 systems are smaller 1 DWPD drives (RAID1 to protect
> against a random single drive failure), and over 3 years mons have eaten
> through 60% of their endurance. Other systems have larger boot drives and
> 2% of their endurance were used up over 1.5 years.
>
> It would still be good to get an understanding why monitors do this, and
> whether there is any way to reduce the amount of writes. Unfortunately,
> Ceph documentation in this regard is severely lacking.
>
> I'm copying this to ceph-docs, perhaps someone will find it useful and
> adjust the hardware recommendations.
>
> /Z
>
> On Wed, 11 Oct 2023, 18:23 Frank Schilder, <frans@xxxxxx> wrote:
>
>> Oh wow! I never bothered looking, because on our hardware the wear is so
>> low:
>>
>> # iotop -ao -bn 2 -d 300
>> Total DISK READ :       0.00 B/s | Total DISK WRITE :       6.46 M/s
>> Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       6.47 M/s
>>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>>    2230 be/4 ceph          0.00 B   1818.71 M  0.00 %  0.46 % ceph-mon
>> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
>> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
>> [rocksdb:low0]
>>    2256 be/4 ceph          0.00 B     19.27 M  0.00 %  0.43 % ceph-mon
>> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
>> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
>> [safe_timer]
>>    2250 be/4 ceph          0.00 B     42.38 M  0.00 %  0.26 % ceph-mon
>> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
>> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
>> [fn_monstore]
>>    2231 be/4 ceph          0.00 B     58.36 M  0.00 %  0.01 % ceph-mon
>> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
>> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
>> [rocksdb:high0]
>>     644 be/3 root          0.00 B    576.00 K  0.00 %  0.00 %
>> [jbd2/sda3-8]
>>    2225 be/4 ceph          0.00 B    128.00 K  0.00 %  0.00 % ceph-mon
>> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
>> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [log]
>> 1637141 be/4 root          0.00 B      0.00 B  0.00 %  0.00 %
>> [kworker/u113:2-flush-8:0]
>> 1636453 be/4 root          0.00 B      0.00 B  0.00 %  0.00 %
>> [kworker/u112:0-ceph0]
>>    1560 be/4 root          0.00 B     20.00 K  0.00 %  0.00 % rsyslogd -n
>> [in:imjournal]
>>    1561 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % rsyslogd -n
>> [rs:main Q:Reg]
>>
>> 1.8GB every 5 minutes, thats 518GB per day. The 400G drives we have are
>> rated 10DWPD and with the 6-drives RAID10 config this gives plenty of
>> life-time. I guess this write load will kill any low-grade SSD (typical
>> bood devices, even enterprise ones) specifically if its smaller drives and
>> the controller doesn't reallocate cells according to remaining write
>> endurance.
>>
>> I guess there was a reason for the recommendations by Dell. I always
>> thought that the recent recommendation for MON store storage in the ceph
>> docs are a "bit unrealistic", apparently both, in size and in performance
>> (including endurance). Well, I guess you need to look for write intensive
>> drives with decent specs. If you do, also go for sufficient size. This will
>> absorb temporary usage peaks that can be very large and also provide extra
>> endurance with SSDs with good controllers.
>>
>> I also think the recommendations on the ceph docs deserve a reality check.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
>> Sent: Wednesday, October 11, 2023 4:30 PM
>> To: Eugen Block
>> Cc: Frank Schilder; ceph-users@xxxxxxx
>> Subject: Re:  Re: Ceph 16.2.x mon compactions, disk writes
>>
>> Eugen,
>>
>> Thanks for your response. May I ask what numbers you're referring to?
>>
>> I am not referring to monitor store.db sizes. I am specifically referring
>> to writes monitors do to their store.db file by frequently rotating and
>> replacing them with new versions during compactions. The size of the
>> store.db remains more or less the same.
>>
>> This is a 300s iotop snippet, sorted by aggregated disk writes:
>>
>> Total DISK READ:        35.56 M/s | Total DISK WRITE:        23.89 M/s
>> Current DISK READ:      35.64 M/s | Current DISK WRITE:      24.09 M/s
>>     TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
>>    4919 be/4 167          16.75 M      2.24 G  0.00 %  1.34 % ceph-mon -n
>> mon.ceph03 -f --setuser ceph --setgr~lt-mon-cluster-log-to-stderr=true
>> [rocksdb:low0]
>>   15122 be/4 167           0.00 B    652.91 M  0.00 %  0.27 % ceph-osd -n
>> osd.31 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17073 be/4 167           0.00 B    651.86 M  0.00 %  0.27 % ceph-osd -n
>> osd.32 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17268 be/4 167           0.00 B    490.86 M  0.00 %  0.18 % ceph-osd -n
>> osd.25 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   18032 be/4 167           0.00 B    463.57 M  0.00 %  0.17 % ceph-osd -n
>> osd.26 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   16855 be/4 167           0.00 B    402.86 M  0.00 %  0.15 % ceph-osd -n
>> osd.22 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17406 be/4 167           0.00 B    387.03 M  0.00 %  0.14 % ceph-osd -n
>> osd.27 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17932 be/4 167           0.00 B    375.42 M  0.00 %  0.13 % ceph-osd -n
>> osd.29 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   18017 be/4 167           0.00 B    359.38 M  0.00 %  0.13 % ceph-osd -n
>> osd.28 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17420 be/4 167           0.00 B    332.83 M  0.00 %  0.12 % ceph-osd -n
>> osd.23 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17975 be/4 167           0.00 B    312.06 M  0.00 %  0.11 % ceph-osd -n
>> osd.30 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>   17273 be/4 167           0.00 B    303.49 M  0.00 %  0.11 % ceph-osd -n
>> osd.24 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
>> [bstore_kv_sync]
>>
>> Not a good example, because sometimes mon writes more intensively, but it
>> is very apparent that thread 4919 of the monitor process is the top disk
>> writer in the system.
>>
>> This is the mon thread producing lots of writes:
>>
>>    4919 167       20   0 2031116   1.1g  10652 S   0.0   0.3 288:48.65
>> rocksdb:low0
>>
>> Then with a combination of lsof and sysdig I determine that the writes
>> are being made to /var/lib/ceph/mon/ceph-ceph03/store.db/*.sst, i.e. the
>> mon's rocksdb store:
>>
>> ceph-mon 4838      167  200r      REG             253,11 67319253
>> 14812899 /var/lib/ceph/mon/ceph-ceph03/store.db/3677146.sst
>> ceph-mon 4838      167  203r      REG             253,11 67228736
>> 14813270 /var/lib/ceph/mon/ceph-ceph03/store.db/3677147.sst
>> ceph-mon 4838      167  205r      REG             253,11 67243212
>> 14813275 /var/lib/ceph/mon/ceph-ceph03/store.db/3677148.sst
>> ceph-mon 4838      167  208r      REG             253,11 67247953
>> 14813316 /var/lib/ceph/mon/ceph-ceph03/store.db/3677149.sst
>> ceph-mon 4838      167  220r      REG             253,11 67261659
>> 14813332 /var/lib/ceph/mon/ceph-ceph03/store.db/3677150.sst
>> ceph-mon 4838      167  221r      REG             253,11 67242500
>> 14813345 /var/lib/ceph/mon/ceph-ceph03/store.db/3677151.sst
>> ceph-mon 4838      167  224r      REG             253,11 67264969
>> 14813348 /var/lib/ceph/mon/ceph-ceph03/store.db/3677152.sst
>> ceph-mon 4838      167  228r      REG             253,11 64346933
>> 14813381 /var/lib/ceph/mon/ceph-ceph03/store.db/3677153.sst
>>
>> By matching iotop and sysdig write records to mon's log entries, I see
>> that the writes happen during "manual compaction" events - whatever they
>> are, because there's no documentation on this whatsoever, and each time
>> around 0.56GB is being written to disk to a new set of *.sst files, which
>> is the total size of the store.db. Looks like from time to time the monitor
>> just reads its store.db and writes it out to a new set of files, as the
>> file names "numbers" increase with each write:
>>
>> ceph-mon 4838      167  175r      REG             253,11 67220863
>> 14812310 /var/lib/ceph/mon/ceph-ceph03/store.db/3677167.sst
>> ceph-mon 4838      167  200r      REG             253,11 67358627
>> 14812899 /var/lib/ceph/mon/ceph-ceph03/store.db/3677168.sst
>> ceph-mon 4838      167  203r      REG             253,11 67277978
>> 14813270 /var/lib/ceph/mon/ceph-ceph03/store.db/3677169.sst
>> ceph-mon 4838      167  205r      REG             253,11 67256312
>> 14813275 /var/lib/ceph/mon/ceph-ceph03/store.db/3677170.sst
>> ceph-mon 4838      167  208r      REG             253,11 67226761
>> 14813316 /var/lib/ceph/mon/ceph-ceph03/store.db/3677171.sst
>> ceph-mon 4838      167  220r      REG             253,11 67258798
>> 14813332 /var/lib/ceph/mon/ceph-ceph03/store.db/3677172.sst
>> ceph-mon 4838      167  221r      REG             253,11 67224665
>> 14813345 /var/lib/ceph/mon/ceph-ceph03/store.db/3677173.sst
>> ceph-mon 4838      167  224r      REG             253,11 67224123
>> 14813348 /var/lib/ceph/mon/ceph-ceph03/store.db/3677174.sst
>> ceph-mon 4838      167  228r      REG             253,11 62195349
>> 14813381 /var/lib/ceph/mon/ceph-ceph03/store.db/3677175.sst
>>
>> I hope this clears up the situation.
>>
>> Do you observe this behavior in your clusters? Can you please check
>> whether your mons do something similar and store.db/*.sst change often?
>>
>> /Z
>>
>> On Wed, 11 Oct 2023 at 16:22, Eugen Block <eblock@xxxxxx<mailto:
>> eblock@xxxxxx>> wrote:
>> That all looks normal to me, to be honest. Can you show some details
>> how you calculate the "hundreds of GB per day"? I see similar stats as
>> Frank on different clusters with different client IO.
>>
>> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx>>:
>>
>> > Sure, nothing unusual there:
>> >
>> > -------
>> >
>> >   cluster:
>> >     id:     3f50555a-ae2a-11eb-a2fc-ffde44714d86
>> >     health: HEALTH_OK
>> >
>> >   services:
>> >     mon: 5 daemons, quorum ceph01,ceph03,ceph04,ceph05,ceph02 (age 2w)
>> >     mgr: ceph01.vankui(active, since 12d), standbys: ceph02.shsinf
>> >     osd: 96 osds: 96 up (since 2w), 95 in (since 3w)
>> >
>> >   data:
>> >     pools:   10 pools, 2400 pgs
>> >     objects: 6.23M objects, 16 TiB
>> >     usage:   61 TiB used, 716 TiB / 777 TiB avail
>> >     pgs:     2396 active+clean
>> >              3    active+clean+scrubbing+deep
>> >              1    active+clean+scrubbing
>> >
>> >   io:
>> >     client:   2.7 GiB/s rd, 27 MiB/s wr, 46.95k op/s rd, 2.17k op/s wr
>> >
>> > -------
>> >
>> > Please disregard the big read number, a customer is running a
>> > read-intensive job. Mon store writes keep happening when the cluster is
>> > much more quiet, thus I think that intensive reads have no effect on the
>> > mons.
>> >
>> > Mgr:
>> >
>> >     "always_on_modules": [
>> >         "balancer",
>> >         "crash",
>> >         "devicehealth",
>> >         "orchestrator",
>> >         "pg_autoscaler",
>> >         "progress",
>> >         "rbd_support",
>> >         "status",
>> >         "telemetry",
>> >         "volumes"
>> >     ],
>> >     "enabled_modules": [
>> >         "cephadm",
>> >         "dashboard",
>> >         "iostat",
>> >         "prometheus",
>> >         "restful"
>> >     ],
>> >
>> > -------
>> >
>> > /Z
>> >
>> >
>> > On Wed, 11 Oct 2023 at 14:50, Eugen Block <eblock@xxxxxx<mailto:
>> eblock@xxxxxx>> wrote:
>> >
>> >> Can you add some more details as requested by Frank? Which mgr modules
>> >> are enabled? What's the current 'ceph -s' output?
>> >>
>> >> > Is autoscaler running and doing stuff?
>> >> > Is balancer running and doing stuff?
>> >> > Is backfill going on?
>> >> > Is recovery going on?
>> >> > Is your ceph version affected by the "excessive logging to MON
>> >> > store" issue that was present starting with pacific but should have
>> >> > been addressed
>> >>
>> >>
>> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx
>> >>:
>> >>
>> >> > We don't use CephFS at all and don't have RBD snapshots apart from
>> some
>> >> > cloning for Openstack images.
>> >> >
>> >> > The size of mon stores isn't an issue, it's < 600 MB. But it gets
>> >> > overwritten often causing lots of disk writes, and that is an issue
>> for
>> >> us.
>> >> >
>> >> > /Z
>> >> >
>> >> > On Wed, 11 Oct 2023 at 14:37, Eugen Block <eblock@xxxxxx<mailto:
>> eblock@xxxxxx>> wrote:
>> >> >
>> >> >> Do you use many snapshots (rbd or cephfs)? That can cause a heavy
>> >> >> monitor usage, we've seen large mon stores on  customer clusters
>> with
>> >> >> rbd mirroring on snapshot basis. In a healthy cluster they have mon
>> >> >> stores of around 2GB in size.
>> >> >>
>> >> >> >> @Eugen: Was there not an option to limit logging to the MON
>> store?
>> >> >>
>> >> >> I don't recall at the moment, worth checking tough.
>> >> >>
>> >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:
>> zakhar@xxxxxxxxx>>:
>> >> >>
>> >> >> > Thank you, Frank.
>> >> >> >
>> >> >> > The cluster is healthy, operating normally, nothing unusual is
>> going
>> >> on.
>> >> >> We
>> >> >> > observe lots of writes by mon processes into mon rocksdb stores,
>> >> >> > specifically:
>> >> >> >
>> >> >> > /var/lib/ceph/mon/ceph-cephXX/store.db:
>> >> >> > 65M     3675511.sst
>> >> >> > 65M     3675512.sst
>> >> >> > 65M     3675513.sst
>> >> >> > 65M     3675514.sst
>> >> >> > 65M     3675515.sst
>> >> >> > 65M     3675516.sst
>> >> >> > 65M     3675517.sst
>> >> >> > 65M     3675518.sst
>> >> >> > 62M     3675519.sst
>> >> >> >
>> >> >> > The site of the files is not huge, but monitors rotate and write
>> out
>> >> >> these
>> >> >> > files often, sometimes several times per minute, resulting in
>> lots of
>> >> >> data
>> >> >> > written to disk. The writes coincide with "manual compaction"
>> events
>> >> >> logged
>> >> >> > by the monitors, for example:
>> >> >> >
>> >> >> > debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1676] [default] [JOB 70854]
>> Compacting
>> >> 1@5
>> >> >> +
>> >> >> > 9@6 files to L6, score -1.00
>> >> >> > debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb:
>> >> EVENT_LOG_v1
>> >> >> > {"time_micros": 1697022610487624, "job": 70854, "event":
>> >> >> > "compaction_started", "compaction_reason": "ManualCompaction",
>> >> >> "files_L5":
>> >> >> > [3675543], "files_L6": [3675533, 3675534, 3675535, 3675536,
>> 3675537,
>> >> >> > 3675538, 3675539, 3675540, 3675541], "score": -1,
>> "input_data_size":
>> >> >> > 601117031}
>> >> >> > debug 2023-10-11T11:10:10.619+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675544: 2015 keys, 67287115 bytes
>> >> >> > debug 2023-10-11T11:10:10.763+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675545: 24343 keys, 67336225 bytes
>> >> >> > debug 2023-10-11T11:10:10.899+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675546: 1196 keys, 67225813 bytes
>> >> >> > debug 2023-10-11T11:10:11.035+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675547: 1049 keys, 67252678 bytes
>> >> >> > debug 2023-10-11T11:10:11.167+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675548: 1081 keys, 67216638 bytes
>> >> >> > debug 2023-10-11T11:10:11.303+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675549: 1196 keys, 67245376 bytes
>> >> >> > debug 2023-10-11T11:10:12.023+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675550: 1195 keys, 67246813 bytes
>> >> >> > debug 2023-10-11T11:10:13.059+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675551: 1205 keys, 67223302 bytes
>> >> >> > debug 2023-10-11T11:10:13.903+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854]
>> Generated
>> >> table
>> >> >> > #3675552: 1312 keys, 56416011 bytes
>> >> >> > debug 2023-10-11T11:10:13.911+0000 7f48a3a9b700  4 rocksdb:
>> >> >> > [compaction/compaction_job.cc:1415] [default] [JOB 70854]
>> Compacted
>> >> 1@5
>> >> >> +
>> >> >> > 9@6 files to L6 => 594449971 bytes
>> >> >> > debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb:
>> (Original
>> >> Log
>> >> >> > Time 2023/10/11-11:10:13.920991)
>> [compaction/compaction_job.cc:760]
>> >> >> > [default] compacted to: base level 5 level multiplier 10.00 max
>> bytes
>> >> >> base
>> >> >> > 268435456 files[0 0 0 0 0 0 9] max score 0.00, MB/sec: 175.8 rd,
>> 173.9
>> >> >> wr,
>> >> >> > level 6, files in(1, 9) out(9) MB in(0.3, 572.9) out(566.9),
>> >> >> > read-write-amplify(3434.6) write-amplify(1707.7) OK, records in:
>> >> 35108,
>> >> >> > records dropped: 516 output_compression: NoCompression
>> >> >> > debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb:
>> (Original
>> >> Log
>> >> >> > Time 2023/10/11-11:10:13.921010) EVENT_LOG_v1 {"time_micros":
>> >> >> > 1697022613921002, "job": 70854, "event": "compaction_finished",
>> >> >> > "compaction_time_micros": 3418822, "compaction_time_cpu_micros":
>> >> 785454,
>> >> >> > "output_level": 6, "num_output_files": 9, "total_output_size":
>> >> 594449971,
>> >> >> > "num_input_records": 35108, "num_output_records": 34592,
>> >> >> > "num_subcompactions": 1, "output_compression": "NoCompression",
>> >> >> > "num_single_delete_mismatches": 0,
>> "num_single_delete_fallthrough": 0,
>> >> >> > "lsm_state": [0, 0, 0, 0, 0, 0, 9]}
>> >> >> >
>> >> >> > The log even mentions the huge write multiplication. I wonder
>> whether
>> >> >> this
>> >> >> > is normal and what can be done about it.
>> >> >> >
>> >> >> > /Z
>> >> >> >
>> >> >> > On Wed, 11 Oct 2023 at 13:55, Frank Schilder <frans@xxxxxx
>> <mailto:frans@xxxxxx>> wrote:
>> >> >> >
>> >> >> >> I need to ask here: where exactly do you observe the hundreds of
>> GB
>> >> >> >> written per day? Are the mon logs huge? Is it the mon store? Is
>> your
>> >> >> >> cluster unhealthy?
>> >> >> >>
>> >> >> >> We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients
>> and
>> >> >> about
>> >> >> >> 800 librbd clients. Per week our mon logs are  about 70M, the
>> cluster
>> >> >> logs
>> >> >> >> about 120M , the audit logs about 70M and I see between
>> 100-200Kb/s
>> >> >> writes
>> >> >> >> to the mon store. That's in the lower-digit GB range per day.
>> >> Hundreds
>> >> >> of
>> >> >> >> GB per day sound completely over the top on a healthy cluster,
>> unless
>> >> >> you
>> >> >> >> have MGR modules changing the OSD/cluster map continuously.
>> >> >> >>
>> >> >> >> Is autoscaler running and doing stuff?
>> >> >> >> Is balancer running and doing stuff?
>> >> >> >> Is backfill going on?
>> >> >> >> Is recovery going on?
>> >> >> >> Is your ceph version affected by the "excessive logging to MON
>> store"
>> >> >> >> issue that was present starting with pacific but should have been
>> >> >> addressed
>> >> >> >> by now?
>> >> >> >>
>> >> >> >> @Eugen: Was there not an option to limit logging to the MON
>> store?
>> >> >> >>
>> >> >> >> For information to readers, we followed old recommendations from
>> a
>> >> Dell
>> >> >> >> white paper for building a ceph cluster and have a 1TB Raid10
>> array
>> >> on
>> >> >> 6x
>> >> >> >> write intensive SSDs for the MON stores. After 5 years we are
>> below
>> >> 10%
>> >> >> >> wear. Average size of the MON store for a healthy cluster is
>> 500M-1G,
>> >> >> but
>> >> >> >> we have seen this ballooning to 100+GB in degraded conditions.
>> >> >> >>
>> >> >> >> Best regards,
>> >> >> >> =================
>> >> >> >> Frank Schilder
>> >> >> >> AIT Risø Campus
>> >> >> >> Bygning 109, rum S14
>> >> >> >>
>> >> >> >> ________________________________________
>> >> >> >> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:
>> zakhar@xxxxxxxxx>>
>> >> >> >> Sent: Wednesday, October 11, 2023 12:00 PM
>> >> >> >> To: Eugen Block
>> >> >> >> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> >> >> >> Subject:  Re: Ceph 16.2.x mon compactions, disk
>> writes
>> >> >> >>
>> >> >> >> Thank you, Eugen.
>> >> >> >>
>> >> >> >> I'm interested specifically to find out whether the huge amount
>> of
>> >> data
>> >> >> >> written by monitors is expected. It is eating through the
>> endurance
>> >> of
>> >> >> our
>> >> >> >> system drives, which were not specced for high DWPD/TBW, as this
>> is
>> >> not
>> >> >> a
>> >> >> >> documented requirement, and monitors produce hundreds of
>> gigabytes of
>> >> >> >> writes per day. I am looking for ways to reduce the amount of
>> >> writes, if
>> >> >> >> possible.
>> >> >> >>
>> >> >> >> /Z
>> >> >> >>
>> >> >> >> On Wed, 11 Oct 2023 at 12:41, Eugen Block <eblock@xxxxxx<mailto:
>> eblock@xxxxxx>> wrote:
>> >> >> >>
>> >> >> >> > Hi,
>> >> >> >> >
>> >> >> >> > what you report is the expected behaviour, at least I see the
>> same
>> >> on
>> >> >> >> > all clusters. I can't answer why the compaction is required
>> that
>> >> >> >> > often, but you can control the log level of the rocksdb output:
>> >> >> >> >
>> >> >> >> > ceph config set mon debug_rocksdb 1/5 (default is 4/5)
>> >> >> >> >
>> >> >> >> > This reduces the log entries and you wouldn't see the manual
>> >> >> >> > compaction logs anymore. There are a couple more rocksdb
>> options
>> >> but I
>> >> >> >> > probably wouldn't change too much, only if you know what you're
>> >> doing.
>> >> >> >> > Maybe Igor can comment if some other tuning makes sense here.
>> >> >> >> >
>> >> >> >> > Regards,
>> >> >> >> > Eugen
>> >> >> >> >
>> >> >> >> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:
>> zakhar@xxxxxxxxx>>:
>> >> >> >> >
>> >> >> >> > > Any input from anyone, please?
>> >> >> >> > >
>> >> >> >> > > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko <
>> >> zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx>>
>> >> >> >> > wrote:
>> >> >> >> > >
>> >> >> >> > >> Any input from anyone, please?
>> >> >> >> > >>
>> >> >> >> > >> It's another thing that seems to be rather poorly
>> documented:
>> >> it's
>> >> >> >> > unclear
>> >> >> >> > >> what to expect, what 'normal' behavior should be, and what
>> can
>> >> be
>> >> >> done
>> >> >> >> > >> about the huge amount of writes by monitors.
>> >> >> >> > >>
>> >> >> >> > >> /Z
>> >> >> >> > >>
>> >> >> >> > >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko <
>> >> zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx>>
>> >> >> >> > wrote:
>> >> >> >> > >>
>> >> >> >> > >>> Hi,
>> >> >> >> > >>>
>> >> >> >> > >>> Monitors in our 16.2.14 cluster appear to quite often run
>> >> "manual
>> >> >> >> > >>> compaction" tasks:
>> >> >> >> > >>>
>> >> >> >> > >>> debug 2023-10-09T09:30:53.888+0000 7f48a329a700  4 rocksdb:
>> >> >> >> > EVENT_LOG_v1
>> >> >> >> > >>> {"time_micros": 1696843853892760, "job": 64225, "event":
>> >> >> >> > "flush_started",
>> >> >> >> > >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251,
>> >> >> >> > >>> "total_data_size": 3870352, "memory_usage": 3886744,
>> >> >> "flush_reason":
>> >> >> >> > >>> "Manual Compaction"}
>> >> >> >> > >>> debug 2023-10-09T09:30:53.904+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
>> >> >> (Original
>> >> >> >> > Log
>> >> >> >> > >>> Time 2023/10/09-09:30:53.910204)
>> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
>> >> >> >> > >>> [default] Manual compaction from level-0 to level-5 from
>> >> 'paxos ..
>> >> >> >> > 'paxos;
>> >> >> >> > >>> will stop at (end)
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
>> >> >> (Original
>> >> >> >> > Log
>> >> >> >> > >>> Time 2023/10/09-09:30:53.911004)
>> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
>> >> >> >> > >>> [default] Manual compaction from level-5 to level-6 from
>> >> 'paxos ..
>> >> >> >> > 'paxos;
>> >> >> >> > >>> will stop at (end)
>> >> >> >> > >>> debug 2023-10-09T09:32:08.956+0000 7f48a329a700  4 rocksdb:
>> >> >> >> > EVENT_LOG_v1
>> >> >> >> > >>> {"time_micros": 1696843928961390, "job": 64228, "event":
>> >> >> >> > "flush_started",
>> >> >> >> > >>> "num_memtables": 1, "num_entries": 1580, "num_deletes":
>> 502,
>> >> >> >> > >>> "total_data_size": 8404605, "memory_usage": 8465840,
>> >> >> "flush_reason":
>> >> >> >> > >>> "Manual Compaction"}
>> >> >> >> > >>> debug 2023-10-09T09:32:08.972+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
>> >> >> (Original
>> >> >> >> > Log
>> >> >> >> > >>> Time 2023/10/09-09:32:08.977739)
>> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
>> >> >> >> > >>> [default] Manual compaction from level-0 to level-5 from
>> 'logm
>> >> ..
>> >> >> >> > 'logm;
>> >> >> >> > >>> will stop at (end)
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
>> >> >> (Original
>> >> >> >> > Log
>> >> >> >> > >>> Time 2023/10/09-09:32:08.978512)
>> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
>> >> >> >> > >>> [default] Manual compaction from level-5 to level-6 from
>> 'logm
>> >> ..
>> >> >> >> > 'logm;
>> >> >> >> > >>> will stop at (end)
>> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.028+0000 7f48a329a700  4 rocksdb:
>> >> >> >> > EVENT_LOG_v1
>> >> >> >> > >>> {"time_micros": 1696844009033151, "job": 64231, "event":
>> >> >> >> > "flush_started",
>> >> >> >> > >>> "num_memtables": 1, "num_entries": 1430, "num_deletes":
>> 251,
>> >> >> >> > >>> "total_data_size": 8975535, "memory_usage": 9035920,
>> >> >> "flush_reason":
>> >> >> >> > >>> "Manual Compaction"}
>> >> >> >> > >>> debug 2023-10-09T09:33:29.044+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
>> >> >> (Original
>> >> >> >> > Log
>> >> >> >> > >>> Time 2023/10/09-09:33:29.049585)
>> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
>> >> >> >> > >>> [default] Manual compaction from level-0 to level-5 from
>> >> 'paxos ..
>> >> >> >> > 'paxos;
>> >> >> >> > >>> will stop at (end)
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
>> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
>> >> >> >> compaction
>> >> >> >> > >>> starting
>> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
>> >> >> (Original
>> >> >> >> > Log
>> >> >> >> > >>> Time 2023/10/09-09:33:29.050355)
>> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
>> >> >> >> > >>> [default] Manual compaction from level-5 to level-6 from
>> >> 'paxos ..
>> >> >> >> > 'paxos;
>> >> >> >> > >>> will stop at (end)
>> >> >> >> > >>>
>> >> >> >> > >>> I have removed a lot of interim log messages to save space.
>> >> >> >> > >>>
>> >> >> >> > >>> During each compaction the monitor process writes
>> approximately
>> >> >> >> 500-600
>> >> >> >> > >>> MB of data to disk over a short period of time. These
>> writes
>> >> add
>> >> >> up
>> >> >> >> to
>> >> >> >> > tens
>> >> >> >> > >>> of gigabytes per hour and hundreds of gigabytes per day.
>> >> >> >> > >>>
>> >> >> >> > >>> Monitor rocksdb and compaction options are default:
>> >> >> >> > >>>
>> >> >> >> > >>>     "mon_compact_on_bootstrap": "false",
>> >> >> >> > >>>     "mon_compact_on_start": "false",
>> >> >> >> > >>>     "mon_compact_on_trim": "true",
>> >> >> >> > >>>     "mon_rocksdb_options":
>> >> >> >> > >>>
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> "write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true",
>> >> >> >> > >>>
>> >> >> >> > >>> Is this expected behavior? Is this something I can adjust
>> in
>> >> >> order to
>> >> >> >> > >>> extend the system storage life?
>> >> >> >> > >>>
>> >> >> >> > >>> Best regards,
>> >> >> >> > >>> Zakhar
>> >> >> >> > >>>
>> >> >> >> > >>
>> >> >> >> > > _______________________________________________
>> >> >> >> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> ceph-users@xxxxxxx>
>> >> >> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> <mailto:ceph-users-leave@xxxxxxx>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > _______________________________________________
>> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> ceph-users@xxxxxxx>
>> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> <mailto:ceph-users-leave@xxxxxxx>
>> >> >> >> >
>> >> >> >> _______________________________________________
>> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> ceph-users@xxxxxxx>
>> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> ceph-users-leave@xxxxxxx>
>> >> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >>
>>
>>
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx