Re: Ceph 16.2.x mon compactions, disk writes

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Wed, 11 Oct 2023 14:41:08 +0300

We don't use CephFS at all and don't have RBD snapshots apart from some
cloning for Openstack images.

The size of mon stores isn't an issue, it's < 600 MB. But it gets
overwritten often causing lots of disk writes, and that is an issue for us.

/Z

On Wed, 11 Oct 2023 at 14:37, Eugen Block <eblock@xxxxxx> wrote:

> Do you use many snapshots (rbd or cephfs)? That can cause a heavy
> monitor usage, we've seen large mon stores on  customer clusters with
> rbd mirroring on snapshot basis. In a healthy cluster they have mon
> stores of around 2GB in size.
>
> >> @Eugen: Was there not an option to limit logging to the MON store?
>
> I don't recall at the moment, worth checking tough.
>
> Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
>
> > Thank you, Frank.
> >
> > The cluster is healthy, operating normally, nothing unusual is going on.
> We
> > observe lots of writes by mon processes into mon rocksdb stores,
> > specifically:
> >
> > /var/lib/ceph/mon/ceph-cephXX/store.db:
> > 65M     3675511.sst
> > 65M     3675512.sst
> > 65M     3675513.sst
> > 65M     3675514.sst
> > 65M     3675515.sst
> > 65M     3675516.sst
> > 65M     3675517.sst
> > 65M     3675518.sst
> > 62M     3675519.sst
> >
> > The site of the files is not huge, but monitors rotate and write out
> these
> > files often, sometimes several times per minute, resulting in lots of
> data
> > written to disk. The writes coincide with "manual compaction" events
> logged
> > by the monitors, for example:
> >
> > debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1676] [default] [JOB 70854] Compacting 1@5
> +
> > 9@6 files to L6, score -1.00
> > debug 2023-10-11T11:10:10.483+0000 7f48a3a9b700  4 rocksdb: EVENT_LOG_v1
> > {"time_micros": 1697022610487624, "job": 70854, "event":
> > "compaction_started", "compaction_reason": "ManualCompaction",
> "files_L5":
> > [3675543], "files_L6": [3675533, 3675534, 3675535, 3675536, 3675537,
> > 3675538, 3675539, 3675540, 3675541], "score": -1, "input_data_size":
> > 601117031}
> > debug 2023-10-11T11:10:10.619+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675544: 2015 keys, 67287115 bytes
> > debug 2023-10-11T11:10:10.763+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675545: 24343 keys, 67336225 bytes
> > debug 2023-10-11T11:10:10.899+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675546: 1196 keys, 67225813 bytes
> > debug 2023-10-11T11:10:11.035+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675547: 1049 keys, 67252678 bytes
> > debug 2023-10-11T11:10:11.167+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675548: 1081 keys, 67216638 bytes
> > debug 2023-10-11T11:10:11.303+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675549: 1196 keys, 67245376 bytes
> > debug 2023-10-11T11:10:12.023+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675550: 1195 keys, 67246813 bytes
> > debug 2023-10-11T11:10:13.059+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675551: 1205 keys, 67223302 bytes
> > debug 2023-10-11T11:10:13.903+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1349] [default] [JOB 70854] Generated table
> > #3675552: 1312 keys, 56416011 bytes
> > debug 2023-10-11T11:10:13.911+0000 7f48a3a9b700  4 rocksdb:
> > [compaction/compaction_job.cc:1415] [default] [JOB 70854] Compacted 1@5
> +
> > 9@6 files to L6 => 594449971 bytes
> > debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb: (Original Log
> > Time 2023/10/11-11:10:13.920991) [compaction/compaction_job.cc:760]
> > [default] compacted to: base level 5 level multiplier 10.00 max bytes
> base
> > 268435456 files[0 0 0 0 0 0 9] max score 0.00, MB/sec: 175.8 rd, 173.9
> wr,
> > level 6, files in(1, 9) out(9) MB in(0.3, 572.9) out(566.9),
> > read-write-amplify(3434.6) write-amplify(1707.7) OK, records in: 35108,
> > records dropped: 516 output_compression: NoCompression
> > debug 2023-10-11T11:10:13.915+0000 7f48a3a9b700  4 rocksdb: (Original Log
> > Time 2023/10/11-11:10:13.921010) EVENT_LOG_v1 {"time_micros":
> > 1697022613921002, "job": 70854, "event": "compaction_finished",
> > "compaction_time_micros": 3418822, "compaction_time_cpu_micros": 785454,
> > "output_level": 6, "num_output_files": 9, "total_output_size": 594449971,
> > "num_input_records": 35108, "num_output_records": 34592,
> > "num_subcompactions": 1, "output_compression": "NoCompression",
> > "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0,
> > "lsm_state": [0, 0, 0, 0, 0, 0, 9]}
> >
> > The log even mentions the huge write multiplication. I wonder whether
> this
> > is normal and what can be done about it.
> >
> > /Z
> >
> > On Wed, 11 Oct 2023 at 13:55, Frank Schilder <frans@xxxxxx> wrote:
> >
> >> I need to ask here: where exactly do you observe the hundreds of GB
> >> written per day? Are the mon logs huge? Is it the mon store? Is your
> >> cluster unhealthy?
> >>
> >> We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients and
> about
> >> 800 librbd clients. Per week our mon logs are  about 70M, the cluster
> logs
> >> about 120M , the audit logs about 70M and I see between 100-200Kb/s
> writes
> >> to the mon store. That's in the lower-digit GB range per day. Hundreds
> of
> >> GB per day sound completely over the top on a healthy cluster, unless
> you
> >> have MGR modules changing the OSD/cluster map continuously.
> >>
> >> Is autoscaler running and doing stuff?
> >> Is balancer running and doing stuff?
> >> Is backfill going on?
> >> Is recovery going on?
> >> Is your ceph version affected by the "excessive logging to MON store"
> >> issue that was present starting with pacific but should have been
> addressed
> >> by now?
> >>
> >> @Eugen: Was there not an option to limit logging to the MON store?
> >>
> >> For information to readers, we followed old recommendations from a Dell
> >> white paper for building a ceph cluster and have a 1TB Raid10 array on
> 6x
> >> write intensive SSDs for the MON stores. After 5 years we are below 10%
> >> wear. Average size of the MON store for a healthy cluster is 500M-1G,
> but
> >> we have seen this ballooning to 100+GB in degraded conditions.
> >>
> >> Best regards,
> >> =================
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >>
> >> ________________________________________
> >> From: Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >> Sent: Wednesday, October 11, 2023 12:00 PM
> >> To: Eugen Block
> >> Cc: ceph-users@xxxxxxx
> >> Subject:  Re: Ceph 16.2.x mon compactions, disk writes
> >>
> >> Thank you, Eugen.
> >>
> >> I'm interested specifically to find out whether the huge amount of data
> >> written by monitors is expected. It is eating through the endurance of
> our
> >> system drives, which were not specced for high DWPD/TBW, as this is not
> a
> >> documented requirement, and monitors produce hundreds of gigabytes of
> >> writes per day. I am looking for ways to reduce the amount of writes, if
> >> possible.
> >>
> >> /Z
> >>
> >> On Wed, 11 Oct 2023 at 12:41, Eugen Block <eblock@xxxxxx> wrote:
> >>
> >> > Hi,
> >> >
> >> > what you report is the expected behaviour, at least I see the same on
> >> > all clusters. I can't answer why the compaction is required that
> >> > often, but you can control the log level of the rocksdb output:
> >> >
> >> > ceph config set mon debug_rocksdb 1/5 (default is 4/5)
> >> >
> >> > This reduces the log entries and you wouldn't see the manual
> >> > compaction logs anymore. There are a couple more rocksdb options but I
> >> > probably wouldn't change too much, only if you know what you're doing.
> >> > Maybe Igor can comment if some other tuning makes sense here.
> >> >
> >> > Regards,
> >> > Eugen
> >> >
> >> > Zitat von Zakhar Kirpichenko <zakhar@xxxxxxxxx>:
> >> >
> >> > > Any input from anyone, please?
> >> > >
> >> > > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >> > wrote:
> >> > >
> >> > >> Any input from anyone, please?
> >> > >>
> >> > >> It's another thing that seems to be rather poorly documented: it's
> >> > unclear
> >> > >> what to expect, what 'normal' behavior should be, and what can be
> done
> >> > >> about the huge amount of writes by monitors.
> >> > >>
> >> > >> /Z
> >> > >>
> >> > >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko <zakhar@xxxxxxxxx>
> >> > wrote:
> >> > >>
> >> > >>> Hi,
> >> > >>>
> >> > >>> Monitors in our 16.2.14 cluster appear to quite often run "manual
> >> > >>> compaction" tasks:
> >> > >>>
> >> > >>> debug 2023-10-09T09:30:53.888+0000 7f48a329a700  4 rocksdb:
> >> > EVENT_LOG_v1
> >> > >>> {"time_micros": 1696843853892760, "job": 64225, "event":
> >> > "flush_started",
> >> > >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251,
> >> > >>> "total_data_size": 3870352, "memory_usage": 3886744,
> "flush_reason":
> >> > >>> "Manual Compaction"}
> >> > >>> debug 2023-10-09T09:30:53.904+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> > Log
> >> > >>> Time 2023/10/09-09:30:53.910204)
> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> > >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
> >> > 'paxos;
> >> > >>> will stop at (end)
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:30:53.908+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> > Log
> >> > >>> Time 2023/10/09-09:30:53.911004)
> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> > >>> [default] Manual compaction from level-5 to level-6 from 'paxos ..
> >> > 'paxos;
> >> > >>> will stop at (end)
> >> > >>> debug 2023-10-09T09:32:08.956+0000 7f48a329a700  4 rocksdb:
> >> > EVENT_LOG_v1
> >> > >>> {"time_micros": 1696843928961390, "job": 64228, "event":
> >> > "flush_started",
> >> > >>> "num_memtables": 1, "num_entries": 1580, "num_deletes": 502,
> >> > >>> "total_data_size": 8404605, "memory_usage": 8465840,
> "flush_reason":
> >> > >>> "Manual Compaction"}
> >> > >>> debug 2023-10-09T09:32:08.972+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> > Log
> >> > >>> Time 2023/10/09-09:32:08.977739)
> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> > >>> [default] Manual compaction from level-0 to level-5 from 'logm ..
> >> > 'logm;
> >> > >>> will stop at (end)
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:08.976+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> > Log
> >> > >>> Time 2023/10/09-09:32:08.978512)
> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> > >>> [default] Manual compaction from level-5 to level-6 from 'logm ..
> >> > 'logm;
> >> > >>> will stop at (end)
> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:32:12.764+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.028+0000 7f48a329a700  4 rocksdb:
> >> > EVENT_LOG_v1
> >> > >>> {"time_micros": 1696844009033151, "job": 64231, "event":
> >> > "flush_started",
> >> > >>> "num_memtables": 1, "num_entries": 1430, "num_deletes": 251,
> >> > >>> "total_data_size": 8975535, "memory_usage": 9035920,
> "flush_reason":
> >> > >>> "Manual Compaction"}
> >> > >>> debug 2023-10-09T09:33:29.044+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> > Log
> >> > >>> Time 2023/10/09-09:33:29.049585)
> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> > >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
> >> > 'paxos;
> >> > >>> will stop at (end)
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f4899286700  4 rocksdb:
> >> > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual
> >> compaction
> >> > >>> starting
> >> > >>> debug 2023-10-09T09:33:29.048+0000 7f48a3a9b700  4 rocksdb:
> (Original
> >> > Log
> >> > >>> Time 2023/10/09-09:33:29.050355)
> >> > [db_impl/db_impl_compaction_flush.cc:2516]
> >> > >>> [default] Manual compaction from level-5 to level-6 from 'paxos ..
> >> > 'paxos;
> >> > >>> will stop at (end)
> >> > >>>
> >> > >>> I have removed a lot of interim log messages to save space.
> >> > >>>
> >> > >>> During each compaction the monitor process writes approximately
> >> 500-600
> >> > >>> MB of data to disk over a short period of time. These writes add
> up
> >> to
> >> > tens
> >> > >>> of gigabytes per hour and hundreds of gigabytes per day.
> >> > >>>
> >> > >>> Monitor rocksdb and compaction options are default:
> >> > >>>
> >> > >>>     "mon_compact_on_bootstrap": "false",
> >> > >>>     "mon_compact_on_start": "false",
> >> > >>>     "mon_compact_on_trim": "true",
> >> > >>>     "mon_rocksdb_options":
> >> > >>>
> >> >
> >>
> "write_buffer_size=33554432,compression=kNoCompression,level_compaction_dynamic_level_bytes=true",
> >> > >>>
> >> > >>> Is this expected behavior? Is this something I can adjust in
> order to
> >> > >>> extend the system storage life?
> >> > >>>
> >> > >>> Best regards,
> >> > >>> Zakhar
> >> > >>>
> >> > >>
> >> > > _______________________________________________
> >> > > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> >
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx