Re: Ceph 16.2.x mon compactions, disk writes

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Wed, 11 Oct 2023 19:00:26 +0300

Thank you, Frank. This confirms that monitors indeed do this, and

Our boot drives in 3 systems are smaller 1 DWPD drives (RAID1 to protect
against a random single drive failure), and over 3 years mons have eaten
through 60% of their endurance. Other systems have larger boot drives and
2% of their endurance were used up over 1.5 years.

It would still be good to get an understanding why monitors do this, and
whether there is any way to reduce the amount of writes. Unfortunately,
Ceph documentation in this regard is severely lacking.

I'm copying this to ceph-docs, perhaps someone will find it useful and
adjust the hardware recommendations.

/Z

On Wed, 11 Oct 2023, 18:23 Frank Schilder, <frans@xxxxxx> wrote:

> Oh wow! I never bothered looking, because on our hardware the wear is so
> low:
>
> # iotop -ao -bn 2 -d 300
> Total DISK READ :       0.00 B/s | Total DISK WRITE :       6.46 M/s
> Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       6.47 M/s
>     TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>    2230 be/4 ceph          0.00 B   1818.71 M  0.00 %  0.46 % ceph-mon
> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
> [rocksdb:low0]
>    2256 be/4 ceph          0.00 B     19.27 M  0.00 %  0.43 % ceph-mon
> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
> [safe_timer]
>    2250 be/4 ceph          0.00 B     42.38 M  0.00 %  0.26 % ceph-mon
> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
> [fn_monstore]
>    2231 be/4 ceph          0.00 B     58.36 M  0.00 %  0.01 % ceph-mon
> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65
> [rocksdb:high0]
>     644 be/3 root          0.00 B    576.00 K  0.00 %  0.00 % [jbd2/sda3-8]
>    2225 be/4 ceph          0.00 B    128.00 K  0.00 %  0.00 % ceph-mon
> --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01
> --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [log]
> 1637141 be/4 root          0.00 B      0.00 B  0.00 %  0.00 %
> [kworker/u113:2-flush-8:0]
> 1636453 be/4 root          0.00 B      0.00 B  0.00 %  0.00 %
> [kworker/u112:0-ceph0]
>    1560 be/4 root          0.00 B     20.00 K  0.00 %  0.00 % rsyslogd -n
> [in:imjournal]
>    1561 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % rsyslogd -n
> [rs:main Q:Reg]
>
> 1.8GB every 5 minutes, thats 518GB per day. The 400G drives we have are
> rated 10DWPD and with the 6-drives RAID10 config this gives plenty of
> life-time. I guess this write load will kill any low-grade SSD (typical
> bood devices, even enterprise ones) specifically if its smaller drives and
> the controller doesn't reallocate cells according to remaining write
> endurance.
>
> I guess there was a reason for the recommendations by Dell. I always
> thought that the recent recommendation for MON store storage in the ceph
> docs are a "bit unrealistic", apparently both, in size and in performance
> (including endurance). Well, I guess you need to look for write intensive
> drives with decent specs. If you do, also go for sufficient size. This will
> absorb temporary usage peaks that can be very large and also provide extra
> endurance with SSDs with good controllers.
>
> I also think the recommendations on the ceph docs deserve a reality check.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> Sent: Wednesday, October 11, 2023 4:30 PM
> To: Eugen Block
> Cc: Frank Schilder; ceph-users@xxxxxxx
> Subject: Re:  Re: Ceph 16.2.x mon compactions, disk writes
>
> Eugen,
>
> Thanks for your response. May I ask what numbers you're referring to?
>
> I am not referring to monitor store.db sizes. I am specifically referring
> to writes monitors do to their store.db file by frequently rotating and
> replacing them with new versions during compactions. The size of the
> store.db remains more or less the same.
>
> This is a 300s iotop snippet, sorted by aggregated disk writes:
>
> Total DISK READ:        35.56 M/s | Total DISK WRITE:        23.89 M/s
> Current DISK READ:      35.64 M/s | Current DISK WRITE:      24.09 M/s
>     TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
>    4919 be/4 167          16.75 M      2.24 G  0.00 %  1.34 % ceph-mon -n
> mon.ceph03 -f --setuser ceph --setgr~lt-mon-cluster-log-to-stderr=true
> [rocksdb:low0]
>   15122 be/4 167           0.00 B    652.91 M  0.00 %  0.27 % ceph-osd -n
> osd.31 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17073 be/4 167           0.00 B    651.86 M  0.00 %  0.27 % ceph-osd -n
> osd.32 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17268 be/4 167           0.00 B    490.86 M  0.00 %  0.18 % ceph-osd -n
> osd.25 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   18032 be/4 167           0.00 B    463.57 M  0.00 %  0.17 % ceph-osd -n
> osd.26 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   16855 be/4 167           0.00 B    402.86 M  0.00 %  0.15 % ceph-osd -n
> osd.22 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17406 be/4 167           0.00 B    387.03 M  0.00 %  0.14 % ceph-osd -n
> osd.27 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17932 be/4 167           0.00 B    375.42 M  0.00 %  0.13 % ceph-osd -n
> osd.29 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   18017 be/4 167           0.00 B    359.38 M  0.00 %  0.13 % ceph-osd -n
> osd.28 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17420 be/4 167           0.00 B    332.83 M  0.00 %  0.12 % ceph-osd -n
> osd.23 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17975 be/4 167           0.00 B    312.06 M  0.00 %  0.11 % ceph-osd -n
> osd.30 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>   17273 be/4 167           0.00 B    303.49 M  0.00 %  0.11 % ceph-osd -n
> osd.24 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug
> [bstore_kv_sync]
>
> Not a good example, because sometimes mon writes more intensively, but it
> is very apparent that thread 4919 of the monitor process is the top disk
> writer in the system.
>
> This is the mon thread producing lots of writes:
>
>    4919 167       20   0 2031116   1.1g  10652 S   0.0   0.3 288:48.65
> rocksdb:low0
>
> Then with a combination of lsof and sysdig I determine that the writes are
> being made to /var/lib/ceph/mon/ceph-ceph03/store.db/*.sst, i.e. the mon's
> rocksdb store:
>
> ceph-mon 4838      167  200r      REG             253,11 67319253 14812899
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677146.sst
> ceph-mon 4838      167  203r      REG             253,11 67228736 14813270
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677147.sst
> ceph-mon 4838      167  205r      REG             253,11 67243212 14813275
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677148.sst
> ceph-mon 4838      167  208r      REG             253,11 67247953 14813316
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677149.sst
> ceph-mon 4838      167  220r      REG             253,11 67261659 14813332
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677150.sst
> ceph-mon 4838      167  221r      REG             253,11 67242500 14813345
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677151.sst
> ceph-mon 4838      167  224r      REG             253,11 67264969 14813348
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677152.sst
> ceph-mon 4838      167  228r      REG             253,11 64346933 14813381
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677153.sst
>
> By matching iotop and sysdig write records to mon's log entries, I see
> that the writes happen during "manual compaction" events - whatever they
> are, because there's no documentation on this whatsoever, and each time
> around 0.56GB is being written to disk to a new set of *.sst files, which
> is the total size of the store.db. Looks like from time to time the monitor
> just reads its store.db and writes it out to a new set of files, as the
> file names "numbers" increase with each write:
>
> ceph-mon 4838      167  175r      REG             253,11 67220863 14812310
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677167.sst
> ceph-mon 4838      167  200r      REG             253,11 67358627 14812899
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677168.sst
> ceph-mon 4838      167  203r      REG             253,11 67277978 14813270
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677169.sst
> ceph-mon 4838      167  205r      REG             253,11 67256312 14813275
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677170.sst
> ceph-mon 4838      167  208r      REG             253,11 67226761 14813316
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677171.sst
> ceph-mon 4838      167  220r      REG             253,11 67258798 14813332
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677172.sst
> ceph-mon 4838      167  221r      REG             253,11 67224665 14813345
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677173.sst
> ceph-mon 4838      167  224r      REG             253,11 67224123 14813348
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677174.sst
> ceph-mon 4838      167  228r      REG             253,11 62195349 14813381
> /var/lib/ceph/mon/ceph-ceph03/store.db/3677175.sst
>
> I hope this clears up the situation.
>
> Do you observe this behavior in your clusters? Can you please check
> whether your mons do something similar and store.db/*.sst change often?
>
> /Z
>
> On Wed, 11 Oct 2023 at 16:22, Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx>> wrote:
> That all looks normal to me, to be honest. Can you show some details
> how you calculate the "hundreds of GB per day"? I see similar stats as
> Frank on different clusters with different client IO.
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx>>:
>
> > Sure, nothing unusual there:
> >
> > -------
> >
> >   cluster:
> >     id:     3f50555a-ae2a-11eb-a2fc-ffde44714d86
> >     health: HEALTH_OK
> >
> >   services:
> >     mon: 5 daemons, quorum ceph01,ceph03,ceph04,ceph05,ceph02 (age 2w)
> >     mgr: ceph01.vankui(active, since 12d), standbys: ceph02.shsinf
> >     osd: 96 osds: 96 up (since 2w), 95 in (since 3w)
> >
> >   data:
> >     pools:   10 pools, 2400 pgs
> >     objects: 6.23M objects, 16 TiB
> >     usage:   61 TiB used, 716 TiB / 777 TiB avail
> >     pgs:     2396 active+clean
> >              3    active+clean+scrubbing+deep
> >              1    active+clean+scrubbing
> >
> >   io:
> >     client:   2.7 GiB/s rd, 27 MiB/s wr, 46.95k op/s rd, 2.17k op/s wr
> >
> > -------
> >
> > Please disregard the big read number, a customer is running a
> > read-intensive job. Mon store writes keep happening when the cluster is
> > much more quiet, thus I think that intensive reads have no effect on the
> > mons.
> >
> > Mgr:
> >
> >     "always_on_modules": [
> >         "balancer",
> >         "crash",
> >         "devicehealth",
> >         "orchestrator",
> >         "pg_autoscaler",
> >         "progress",
> >         "rbd_support",
> >         "status",
> >         "telemetry",
> >         "volumes"
> >     ],
> >     "enabled_modules": [
> >         "cephadm",
> >         "dashboard",
> >         "iostat",
> >         "prometheus",
> >         "restful"
> >     ],
> >
> > -------
> >
> > /Z
> >
> >
> > On Wed, 11 Oct 2023 at 14:50, Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx>> wrote:
> >
> >> Can you add some more details as requested by Frank? Which mgr modules
> >> are enabled? What's the current 'ceph -s' output?
> >>
> >> > Is autoscaler running and doing stuff?
> >> > Is balancer running and doing stuff?
> >> > Is backfill going on?
> >> > Is recovery going on?
> >> > Is your ceph version affected by the "excessive logging to MON
> >> > store" issue that was present starting with pacific but should have
> >> > been addressed
> >>
> >>
> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx
> >>:
> >>
> >> > We don't use CephFS at all and don't have RBD snapshots apart from
> some
> >> > cloning for Openstack images.
> >> >
> >> > The size of mon stores isn't an issue, it's < 600 MB. But it gets
> >> > overwritten often causing lots of disk writes, and that is an issue
> for
> >> us.
> >> >
> >> > /Z
> >> >
> >> > On Wed, 11 Oct 2023 at 14:37, Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx>> wrote:
> >> >
> >> >> Do you use many snapshots (rbd or cephfs)? That can cause a heavy
> >> >> monitor usage, we've seen large mon stores on  customer clusters with
> >> >> rbd mirroring on snapshot basis. In a healthy cluster they have mon
> >> >> stores of around 2GB in size.
> >> >>
> >> >> >> @Eugen: Was there not an option to limit logging to the MON store?
> >> >>
> >> >> I don't recall at the moment, worth checking tough.
> >> >>
> >> >> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:
> zakhar@xxxxxxxxx>>:
> >> >>
> >> >> > Thank you, Frank.
> >> >> >
> >> >> > The cluster is healthy, operating normally, nothing unusual is
> going
> >> on.
> >> >> We
> >> >> > observe lots of writes by mon processes into mon rocksdb stores,
> >> >> > specifically:
> >> >> >
> >> >> > /var/lib/ceph/mon/ceph-cephXX/store.db:
> >> >> > 65M     3675511.sst
> >> >> > 65M     3675512.sst
> >> >> > 65M     3675513.sst
> >> >> > 65M     3675514.sst
> >> >> > 65M     3675515.sst
> >> >> > 65M     3675516.sst
> >> >> > 65M     3675517.sst
> >> >> > 65M     3675518.sst
> >> >> > 62M     3675519.sst
> >> >> >
> >> >> > The site of the files is not huge, but monitors rotate and write
> out
> >> >> these
> >> >> > files often, sometimes several times per minute, resulting in lots
> of
> >> >> data
> >> >> > written to disk. The writes coincide with "manual compaction"
> events
> >> >> logged
> >> >> > by the monitors, for example:
> >> >> >
> >> >> > debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1676] [default] [JOB 70854]
> Compacting
> >> 1@5
> >> >> +
> >> >> > 9@6 files to L6, score -1.00
> >> >> > debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb:
> >> EVENT_LOG_v1
> >> >> > {"time_micros": 1697022610487624, "job": 70854, "event":
> >> >> > "compaction_started", "compaction_reason": "ManualCompaction",
> >> >> "files_L5":
> >> >> > [3675543], "files_L6": [3675533, 3675534, 3675535, 3675536,
> 3675537,
> >> >> > 3675538, 3675539, 3675540, 3675541], "score": -1,
> "input_data_size":
> >> >> > 601117031}
> >> >> > debug 2023-10-11T11:10:10.619+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675544: 2015 keys, 67287115 bytes
> >> >> > debug 2023-10-11T11:10:10.763+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675545: 24343 keys, 67336225 bytes
> >> >> > debug 2023-10-11T11:10:10.899+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675546: 1196 keys, 67225813 bytes
> >> >> > debug 2023-10-11T11:10:11.035+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675547: 1049 keys, 67252678 bytes
> >> >> > debug 2023-10-11T11:10:11.167+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675548: 1081 keys, 67216638 bytes
> >> >> > debug 2023-10-11T11:10:11.303+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675549: 1196 keys, 67245376 bytes
> >> >> > debug 2023-10-11T11:10:12.023+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675550: 1195 keys, 67246813 bytes
> >> >> > debug 2023-10-11T11:10:13.059+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675551: 1205 keys, 67223302 bytes
> >> >> > debug 2023-10-11T11:10:13.903+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated
> >> table
> >> >> > #3675552: 1312 keys, 56416011 bytes
> >> >> > debug 2023-10-11T11:10:13.911+0000 7f48a3a9b700  4 rocksdb:
> >> >> > [compaction/compaction_job.cc:1415] [default] [JOB 70854] Compacted
> >> 1@5
> >> >> +
> >> >> > 9@6 files to L6 => 594449971 bytes
> >> >> > debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> Log
> >> >> > Time 2023/10/11-11:10:13.920991) [compaction/compaction_job.cc:760]
> >> >> > [default] compacted to: base level 5 level multiplier 10.00 max
> bytes
> >> >> base
> >> >> > 268435456 files[0 0 0 0 0 0 9] max score 0.00, MB/sec: 175.8 rd,
> 173.9
> >> >> wr,
> >> >> > level 6, files in(1, 9) out(9) MB in(0.3, 572.9) out(566.9),
> >> >> > read-write-amplify(3434.6) write-amplify(1707.7) OK, records in:
> >> 35108,
> >> >> > records dropped: 516 output_compression: NoCompression
> >> >> > debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> Log
> >> >> > Time 2023/10/11-11:10:13.921010) EVENT_LOG_v1 {"time_micros":
> >> >> > 1697022613921002, "job": 70854, "event": "compaction_finished",
> >> >> > "compaction_time_micros": 3418822, "compaction_time_cpu_micros":
> >> 785454,
> >> >> > "output_level": 6, "num_output_files": 9, "total_output_size":
> >> 594449971,
> >> >> > "num_input_records": 35108, "num_output_records": 34592,
> >> >> > "num_subcompactions": 1, "output_compression": "NoCompression",
> >> >> > "num_single_delete_mismatches": 0,
> "num_single_delete_fallthrough": 0,
> >> >> > "lsm_state": [0, 0, 0, 0, 0, 0, 9]}
> >> >> >
> >> >> > The log even mentions the huge write multiplication. I wonder
> whether
> >> >> this
> >> >> > is normal and what can be done about it.
> >> >> >
> >> >> > /Z
> >> >> >
> >> >> > On Wed, 11 Oct 2023 at 13:55, Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> >> >> >
> >> >> >> I need to ask here: where exactly do you observe the hundreds of
> GB
> >> >> >> written per day? Are the mon logs huge? Is it the mon store? Is
> your
> >> >> >> cluster unhealthy?
> >> >> >>
> >> >> >> We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients
> and
> >> >> about
> >> >> >> 800 librbd clients. Per week our mon logs are  about 70M, the
> cluster
> >> >> logs
> >> >> >> about 120M , the audit logs about 70M and I see between
> 100-200Kb/s
> >> >> writes
> >> >> >> to the mon store. That's in the lower-digit GB range per day.
> >> Hundreds
> >> >> of
> >> >> >> GB per day sound completely over the top on a healthy cluster,
> unless
> >> >> you
> >> >> >> have MGR modules changing the OSD/cluster map continuously.
> >> >> >>
> >> >> >> Is autoscaler running and doing stuff?
> >> >> >> Is balancer running and doing stuff?
> >> >> >> Is backfill going on?
> >> >> >> Is recovery going on?
> >> >> >> Is your ceph version affected by the "excessive logging to MON
> store"
> >> >> >> issue that was present starting with pacific but should have been
> >> >> addressed
> >> >> >> by now?
> >> >> >>
> >> >> >> @Eugen: Was there not an option to limit logging to the MON store?
> >> >> >>
> >> >> >> For information to readers, we followed old recommendations from a
> >> Dell
> >> >> >> white paper for building a ceph cluster and have a 1TB Raid10
> array
> >> on
> >> >> 6x
> >> >> >> write intensive SSDs for the MON stores. After 5 years we are
> below
> >> 10%
> >> >> >> wear. Average size of the MON store for a healthy cluster is
> 500M-1G,
> >> >> but
> >> >> >> we have seen this ballooning to 100+GB in degraded conditions.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> =================
> >> >> >> Frank Schilder
> >> >> >> AIT Risø Campus
> >> >> >> Bygning 109, rum S14
> >> >> >>
> >> >> >> ________________________________________
> >> >> >> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:
> zakhar@xxxxxxxxx>>
> >> >> >> Sent: Wednesday, October 11, 2023 12:00 PM
> >> >> >> To: Eugen Block
> >> >> >> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >> >> >> Subject:  Re: Ceph 16.2.x mon compactions, disk writes
> >> >> >>
> >> >> >> Thank you, Eugen.
> >> >> >>
> >> >> >> I'm interested specifically to find out whether the huge amount of
> >> data
> >> >> >> written by monitors is expected. It is eating through the
> endurance
> >> of
> >> >> our
> >> >> >> system drives, which were not specced for high DWPD/TBW, as this
> is
> >> not
> >> >> a
> >> >> >> documented requirement, and monitors produce hundreds of
> gigabytes of
> >> >> >> writes per day. I am looking for ways to reduce the amount of
> >> writes, if
> >> >> >> possible.
> >> >> >>
> >> >> >> /Z
> >> >> >>
> >> >> >> On Wed, 11 Oct 2023 at 12:41, Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx>> wrote:
> >> >> >>
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > what you report is the expected behaviour, at least I see the
> same
> >> on
> >> >> >> > all clusters. I can't answer why the compaction is required that
> >> >> >> > often, but you can control the log level of the rocksdb output:
> >> >> >> >
> >> >> >> > ceph config set mon debug_rocksdb 1/5 (default is 4/5)
> >> >> >> >
> >> >> >> > This reduces the log entries and you wouldn't see the manual
> >> >> >> > compaction logs anymore. There are a couple more rocksdb options
> >> but I
> >> >> >> > probably wouldn't change too much, only if you know what you're
> >> doing.
> >> >> >> > Maybe Igor can comment if some other tuning makes sense here.
> >> >> >> >
> >> >> >> > Regards,
> >> >> >> > Eugen
> >> >> >> >
> >> >> >> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx<mailto:
> zakhar@xxxxxxxxx>>:
> >> >> >> >
> >> >> >> > > Any input from anyone, please?
> >> >> >> > >
> >> >> >> > > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko <
> >> zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx>>
> >> >> >> > wrote:
> >> >> >> > >
> >> >> >> > >> Any input from anyone, please?
> >> >> >> > >>
> >> >> >> > >> It's another thing that seems to be rather poorly documented:
> >> it's
> >> >> >> > unclear
> >> >> >> > >> what to expect, what 'normal' behavior should be, and what
> can
> >> be
> >> >> done
> >> >> >> > >> about the huge amount of writes by monitors.
> >> >> >> > >>
> >> >> >> > >> /Z
> >> >> >> > >>
> >> >> >> > >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko <
> >> zakhar@xxxxxxxxx<mailto:zakhar@xxxxxxxxx>>
> >> >> >> > wrote:
> >> >> >> > >>
> >> >> >> > >>> Hi,
> >> >> >> > >>>
> >> >> >> > >>> Monitors in our 16.2.14 cluster appear to quite often run
> >> "manual
> >> >> >> > >>> compaction" tasks:
> >> >> >> > >>>
> >> >> >> > >>> debug 2023-10-09T09:30:53.888+0000 7f48a329a700  4 rocksdb:
> >> >> >> > EVENT_LOG_v1
> >> >> >> > >>> {"time_micros": 1696843853892760, "job": 64225, "event":
> >> >> >> > "flush_started",
> >> >> >> > >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251,
> >> >> >> > >>> "total_data_size": 3870352, "memory_usage": 3886744,
> >> >> "flush_reason":
> >> >> >> > >>> "Manual Compaction"}
> >> >> >> > >>> debug 2023-10-09T09:30:53.904+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
> >> >> (Original
> >> >> >> > Log
> >> >> >> > >>> Time 2023/10/09-09:30:53.910204)
> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> >> >> > >>> [default] Manual compaction from level-0 to level-5 from
> >> 'paxos ..
> >> >> >> > 'paxos;
> >> >> >> > >>> will stop at (end)
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
> >> >> (Original
> >> >> >> > Log
> >> >> >> > >>> Time 2023/10/09-09:30:53.911004)
> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> >> >> > >>> [default] Manual compaction from level-5 to level-6 from
> >> 'paxos ..
> >> >> >> > 'paxos;
> >> >> >> > >>> will stop at (end)
> >> >> >> > >>> debug 2023-10-09T09:32:08.956+0000 7f48a329a700  4 rocksdb:
> >> >> >> > EVENT_LOG_v1
> >> >> >> > >>> {"time_micros": 1696843928961390, "job": 64228, "event":
> >> >> >> > "flush_started",
> >> >> >> > >>> "num_memtables": 1, "num_entries": 1580, "num_deletes": 502,
> >> >> >> > >>> "total_data_size": 8404605, "memory_usage": 8465840,
> >> >> "flush_reason":
> >> >> >> > >>> "Manual Compaction"}
> >> >> >> > >>> debug 2023-10-09T09:32:08.972+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
> >> >> (Original
> >> >> >> > Log
> >> >> >> > >>> Time 2023/10/09-09:32:08.977739)
> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> >> >> > >>> [default] Manual compaction from level-0 to level-5 from
> 'logm
> >> ..
> >> >> >> > 'logm;
> >> >> >> > >>> will stop at (end)
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
> >> >> (Original
> >> >> >> > Log
> >> >> >> > >>> Time 2023/10/09-09:32:08.978512)
> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> >> >> > >>> [default] Manual compaction from level-5 to level-6 from
> 'logm
> >> ..
> >> >> >> > 'logm;
> >> >> >> > >>> will stop at (end)
> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.028+0000 7f48a329a700  4 rocksdb:
> >> >> >> > EVENT_LOG_v1
> >> >> >> > >>> {"time_micros": 1696844009033151, "job": 64231, "event":
> >> >> >> > "flush_started",
> >> >> >> > >>> "num_memtables": 1, "num_entries": 1430, "num_deletes": 251,
> >> >> >> > >>> "total_data_size": 8975535, "memory_usage": 9035920,
> >> >> "flush_reason":
> >> >> >> > >>> "Manual Compaction"}
> >> >> >> > >>> debug 2023-10-09T09:33:29.044+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
> >> >> (Original
> >> >> >> > Log
> >> >> >> > >>> Time 2023/10/09-09:33:29.049585)
> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> >> >> > >>> [default] Manual compaction from level-0 to level-5 from
> >> 'paxos ..
> >> >> >> > 'paxos;
> >> >> >> > >>> will stop at (end)
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> >> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> >> >> compaction
> >> >> >> > >>> starting
> >> >> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
> >> >> (Original
> >> >> >> > Log
> >> >> >> > >>> Time 2023/10/09-09:33:29.050355)
> >> >> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> >> >> > >>> [default] Manual compaction from level-5 to level-6 from
> >> 'paxos ..
> >> >> >> > 'paxos;
> >> >> >> > >>> will stop at (end)
> >> >> >> > >>>
> >> >> >> > >>> I have removed a lot of interim log messages to save space.
> >> >> >> > >>>
> >> >> >> > >>> During each compaction the monitor process writes
> approximately
> >> >> >> 500-600
> >> >> >> > >>> MB of data to disk over a short period of time. These writes
> >> add
> >> >> up
> >> >> >> to
> >> >> >> > tens
> >> >> >> > >>> of gigabytes per hour and hundreds of gigabytes per day.
> >> >> >> > >>>
> >> >> >> > >>> Monitor rocksdb and compaction options are default:
> >> >> >> > >>>
> >> >> >> > >>>     "mon_compact_on_bootstrap": "false",
> >> >> >> > >>>     "mon_compact_on_start": "false",
> >> >> >> > >>>     "mon_compact_on_trim": "true",
> >> >> >> > >>>     "mon_rocksdb_options":
> >> >> >> > >>>
> >> >> >> >
> >> >> >>
> >> >>
> >>
> "write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true",
> >> >> >> > >>>
> >> >> >> > >>> Is this expected behavior? Is this something I can adjust in
> >> >> order to
> >> >> >> > >>> extend the system storage life?
> >> >> >> > >>>
> >> >> >> > >>> Best regards,
> >> >> >> > >>> Zakhar
> >> >> >> > >>>
> >> >> >> > >>
> >> >> >> > > _______________________________________________
> >> >> >> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> >> >> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> <mailto:ceph-users-leave@xxxxxxx>
> >> >> >> >
> >> >> >> >
> >> >> >> > _______________________________________________
> >> >> >> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> >> >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> <mailto:ceph-users-leave@xxxxxxx>
> >> >> >> >
> >> >> >> _______________________________________________
> >> >> >> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> >> >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx